Sunday, 26 April 2009

Strongly Specified Parser Combinators

Following the approach Wouter Swierstra used for Hoare State Monad, we define a Parser monad with pre and post conditions that express soundness (but not completeness) of the parser.

The computational aspect of it is the same old parser monad, type Parser s a = [s] -> [(s, [a])], which takes a list of tokens [s] to a (possibly empty) list (here is where backtracking/nondeterminism comes into it) of parses paired with the rest of the text. It's a monad and also a monad plus.

On the specification end, we can put a precondition on the input string (for example, you might say the input length is <= some value, for ensuring termination of a recursive parser) and also a post condition comparing the input with the parsed value and the remaining output.


Definition Pre := list s -> Prop.
Definition Post (t : Set) := list s -> t -> list s -> Prop.

Program Definition Parser (pre : Pre) (t : Set) (post : Post t) : Set :=
forall i : { t : list s | pre t }, list ({ (x, r) : t * list s | post i x r }).


in Haskell we might define:

m >>= f = \i -> concatMap (uncurry f) (m i)


And in Coq, using Program we have roughly the same thing. Except that one has to apply a 'noncomputational_map' to fudge the proofs paired up with the list elements.


Program Definition Bind (a b : Set) P1 P2 Q1 Q2
(m : Parser P1 a Q1)
(f : (forall x : a, Parser (P2 x) b (Q2 x))) :
Parser (fun i => P1 i /\ forall x o, Q1 i x o -> P2 x o)
b
(fun i x' o' => exists x o, Q1 i x o /\ Q2 x o x' o') :=
fun i => @flat_map ({ (x, o) : a * list s | Q1 i x o }) _
(fun xo => match xo with (x,o) => noncomputational_map _ _ _ _ (f x o) end)
(m i).


Seeing as noncomputational map doesn't do anything, we prove a theorem expressing that as justification for extracting it out as the identity function (rather it being equivalent to map id traversing the whole list).


Theorem noncomputational_map_identity :
forall l,
map (@proj1_sig _ _) l = map (@proj1_sig _ _) (noncomputational_map l).

Extract Inlined Constant noncomputational_map => "id".


Another nice parser combinator is the fixed point of a parser:


Program Definition Fix t {P Q} :
(forall i : { t : list s | P t },
Parser (fun i' => length i' < length i /\ P i') t Q ->
list ({ (x, o) : t * list s | Q i x o })) ->
Parser P t Q :=
fun Rec =>
well_founded_induction
(well_founded_ltof ({ i : list s | P i }) (fun i => length i))
(fun i => list ({ (x, o) : t * list s | Q i x o }))
(fun x Rec' => Rec x (fun i => Rec' i _)).


It packages up well founded recursion on the size of the input string, so that any non-left recursive parsers should be easily defined.

Enough of the heavy machinary! A simple example of putting thing into work now:

 <par> ::= <epsilon> | '(' <par> ')' <par>


To parse this we define a type of tokens and abstract syntax of parsing derivations - then relate them with a function:


Inductive token := open | close.
Inductive par := epsilon : par | wrappend : par -> par -> par.

Fixpoint print (p : par) : list token :=
match p with
| epsilon => nil
| wrappend m n => (open::nil) ++ print m ++ (close::nil) ++ print n
end.


Now Program lets us define the par parser as the fixed point of the sum of epsilon and wrapped recursions:


Program Definition par_parser : Parser token (fun _ => True) par (fun x p y => x = print p ++ y /\ length y <= length x) :=
Fix token par (fun i parRec =>
Plus _ _ (fun i' => i = i') _
(fudge_pre_and_post_conditions _ _ _
(Return epsilon))
(fudge_pre_and_post_conditions _ _ _
(Symbol token eq_token_dec open >>= fun _ =>
parRec >>= fun m =>
Symbol token eq_token_dec close >>= fun _ =>
parRec >>= fun n =>
Return (wrappend m n)))
i).


Here are the actual scripts http://github.com/odge/parseq/tree/master