Recursive Descent Parsers

Parsing and grammars

A grammar describes how to generate a sequence of tokens that satisfies certain rules. For instance, the grammar for a basic calculator would be:

e = 'integer' [Value]
  | e '+' e   [Add]
  | e '*' e   [Mul]
  | '(' e ')'

Parsing is the process of obtaining a derivation that generates a sequence of tokens from a grammar. For instance, a token sequence such as:

2 + 3 * 4

Can be created from the above grammar using one of the derivations:

Mul (Add (Value 2, Value 3), Value 4)
Add(Value 2, Mul(Value 3, Value 4))

Of course, the grammar used as an example is ambiguous (there are several possible derivations), and in the real world one prefers non-ambigous grammars (so that the result of the parsing is always unique). An example of non-ambiguous grammar for the above is a grammar that takes into account the differences in priority for addition and substraction. So:

base = 'integer' [Value]
     | '(' expr ')' 

add = mul
    | mul '+' add [Add]

mul = base
    | base '*' mul [Mul]

expr = add

Where the first derivation (which was incorrect as far as priorities are concerned) is now impossible, because an Add can only appear inside a Mul if it’s inside brackets.

Recursive Descent Parsers

The simplest way to construct a parser from a grammar is to use a recursive descent parser. Assuming that the input is tokenized (every integer and operator appears as a single string within a list of strings), then a parser that effectively recognizes the expressions above is:

exception Syntax 

type r =
  | Value of int
  | Add of r * r | Mul of r * r

let rec rule_base = function
  | [] -> raise Syntax
  | "(" :: t -> let e, t = rule_expr t in
                ( match t with ")" :: t -> raise Syntax | l -> e, l )
  | i :: t -> try Value (int_of_string i), t with _ -> raise Syntax
and rule_mul t =
  let a, t = rule_base t in
  match t with
    | "*" :: t -> let b, t = rule_mul t in Mul (a,b), t
    | _ -> a, t
and rule_add t =
  let a, t = rule_mul t in
  match t with
    | "+" :: t -> let b, t = rule_add t in Add (a,b), t
    | _ -> a, t
and rule_expr t = rule_add t 

let _ = rule_expr ["2";"+";"3";"*";"4"]

The parser correctly answers Add (Value 2, Mul (Value 3, Value 4)).

Handling Errors

A syntax error appears when the provided sequence does not follow the grammar, and there is therefore no possible derivation for that sequence. Syntax errors have three possible causes : a necessary token was forgotten (consider the sequence (2 + 3 * 4 where a closing brace is missing), a token was inserted where it doesn’t belong (consider the sequence 2 ++ 3 * 4 where a surnumerary plus was added) or a correct token was replacedwth an incorrect token (consider @ + 3 * 4 where the 2 was mistyped into an @). Most of the time, when a token that doesn’t fit is encountered, it’s difficult to determine whether that token is the error, or whether a token was forgotten before it (should 2 ++ 3 be 2 + 3 or 2 + 1 + 3?).

Because of this, most parsers merely indicate the problematic token where a syntax error has happened (this is what Objective Caml does) while others also add the list of tokens that could have been accepted (see for instance PHP’s infamous Paamayim Nekudotayim error message).

The problem is that naive Recursive Descent Parsers are quite bad at handling or detecting syntax errors as soon as alternatives are involved. Consider the grammar built on top of the previous one:

def = 'let' 'n' '=' def 'in' def [Def1]
    | expr                       [Def2]

prog = 'eof'                   [Prog1]
    | 'let' 'n' '=' def prog [Prog2]
    | def ';;' pro           [Prog3]

Consider now the sequence of tokens let n = 10 %, where the final token is obviously an error. The parser will explore the possible derivations in this order:

  • Prog1 fails (‘let’ is not ‘eof’).
  • Prog2 runs and reaches ‘prog’:
    • Prog1 fails (‘%’ is not ‘eof’)
    • Prog2 fails (‘%’ is not ‘let’)
    • Prog3 reaches ‘def’:
      • Def1 fails (‘%’ is not ‘let’)
      • Def2 fails (‘%’ does not exist in ‘expr’)
    • … so ‘prog’ fails.
  • … so Prog2 fails.
  • Prog3 reaches ‘def’:
    • Def1 fails (‘%’ is not ‘in’)
    • Def2 fails (‘let’ does not exist in ‘expr’)
  • … so Prog3 fails.

The last failure encountered was that ‘let’ is not a valid initial token, since it cannot generate ‘expr’. Displaying this error is of course nonsensical : the actual error is the final token, not the first one.

The solution is to store the furthest token that generated an error (in terms of distance from the first token in the token stream) along with the list of tokens that it was expected to be. In this case, the token errors would be:

  • Token 0 (‘let’) : expected ‘eof’, ‘(‘ or ‘value’
  • Token 3 (’10′) : expected ‘let’ or ‘(‘
  • Token 4 (‘%’) : expected ‘eof’, ‘let’, ‘(‘, ‘value’ or ‘in’

And so the error provided would be along the line of “Syntax error on token ‘%’ after token ’10′, expected ‘let’, ‘(‘, ‘in’, a value or the end of file”.

Another key idea to keep in mind is the fact that one can remember opening braces, and when a closing brace is expected, signal the position of the opening brace.This would help for a missing ‘=’ after a ‘let’, for instance.

A sample implementation

The implementation can be found here.

Its interface header would look like this:

(** The description of a set of tokens. *)
module type TOKENS =
  sig
    (** The available tokens. *)
    type token 

    (** Errors generated when tokens are expected but
        not found. *)
    type error 

    (** Printing an error *)
    val string_of_error : error -> string 

    (** The end-of-file token. *)
    val eof : token
  end 

(** A recursive descent parser over a set of tokens. *)
module RecursiveDescentParser :
  functor (Tokens : TOKENS) ->
    sig 

(* ------------------------------------------------------ *) 

      (** Description of a failure *)
      type failure 

      (** Show a failure. Internally uses 
          Tokens.string_of_error on involved errors. *)
      val pretty_print : failure -> string 

      (** The exception thrown when a failure occurs. *)
      exception ParseError of failure 

(* ------------------------------------------------------ *)       

      (** An expectation function. Returns true if the token
          was expected *)
      type expect = Tokens.token -> bool 

(* ------------------------------------------------------ *) 

      (** A parsing context. This is what your parsing rules
          will be passing around. *)
      type ctx 

      (** A rule : receives a context, returns whatever was
          extracted from the context along with the new 
          context (includes any possible failures or removed
          tokens. *)
      type 'a rule = ctx -> 'a * ctx 

(* ------------------------------------------------------ *) 

      (** Get the next token in a context. Does not alter
          the context. Return Tokens.eof if no more tokens
          are available. *)
      val next : ctx -> Tokens.token 

      (** Get a context minus its first token. An empty 
          context remains an empty context. The original 
          context is not altered. *)
      val shift : ctx -> ctx 

      (** Create a new context from a list of tokens *)
      val stream_of_list : Tokens.token list -> ctx 

(* ------------------------------------------------------ *) 

      (** Have the context remember a certain failure 
          (to improve reporting) before trying again to 
          parse something else. *)
      val add_failure : failure -> ctx -> ctx 

      (** Raises a [ParseError] exception bound to the
          provided context and error. *)
      val fail : Tokens.error -> ctx -> 'a 

      (** Raises a [ParseError] exception.
          [fail_close err old ctx] works like 
          [fail err ctx], but the error is also bound 
          to a former context [old]. This serves to reference
          the opening brace tied to a missing closing brace,
          or similar constructs. *)
      val fail_close : Tokens.error -> ctx -> ctx -> 'a 

      (** Either the expectation is satisfied by the next
          token, or call [fail]. *)
      val expect : expect * Tokens.error -> ctx -> ctx 

      (** As [expect], but uses [fail_close]. *)
      val expect_close : expect * Tokens.error -> ctx -> ctx -> ctx 

(* ------------------------------------------------------ *) 

      (** Given a rule, creates a parenthesized version which
          [expect]s a first token, then reads the rule, then 
          [expect_close]s a second token, then returns the data
          returned by the rule. *)
      val between :
        expect * Tokens.error -> 'a rule -> expect * Tokens.error -> 'a rule 

      (** Given a list of rules, tries every one of them in
          order and returns the data of the first that did 
          not throw an exception. If all fail, also fails. *)
      val one_of : 'a rule list -> 'a rule 

      (** Reads as many instances of the rule from the input, until
          it fails, and returns the list in order. *)
      val many : 'a rule -> 'a list rule 

    end

The interface could certainly be improved (for instance, monadic manipulation of rules, or access to the internal representation) and so could the implementation (the manyfunction is not tail-recursive yet). Also, another of the classic issues of recursive descent parser (which is left-associative operators) is not solved.

0 Responses to “Recursive Descent Parsers”


  1. No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>



1150 feed subscribers
(readers who polled a feed this week)