The traditional way to specify the syntax of a programming language is to write a Context Free Grammar (CFG) in Extended Backus-Naur Form (EBNF). Briefly, this form consists of a series of rules, each describing the form of some syntax element. For example, we might have a rule for postfix expression in C:
postfix-expression: primary-expression | postfix-expression '[' expression ']' | postfix-expression '(' argument-expression-list ')' | postfix-expression '.' identifier ⋮
This says that a postfix expression can be a primary expression (constant, identifier, or expression in parentheses) or it can be a postfix expression followed by some operator. An expression like
is thus parsed as
((primary-expression) '.' identifier) '.' identifier
The rule for
postfix-expression is known as left recursive because
postfix-expression appears as the leftmost symbol in some of the productions.
This causes a problem for most left-to-right parsers. When they attempt to parse
a postfix expression, they might first attempt to parse it as a
primary-expression. If that fails, they might attempt to parse it as an array
subscript. To do this, they first need to parse a
leads to an infinite recursion, where the parser continues to nest deeper and
deeper, while never consuming any input.
The traditional solution is to rewrite our rule into a right recursive form.
In this case, we notice that every
postfix-expression begins with a
primary-expression, followed by a (possibly empty) chain of subscripts, member accesses and
We can thus rewrite the grammar as
postfix-expression: primary-expression postfix-expression' postfix-expression': empty | '[' expression ']' postfix-expression' | '(' argument-expression-list ')' postfix-expression' | '.' identifier postfix-expression' ⋮
Unfortunately, this changes the associativity of the operations. Instead of being left associative, they are now right associative, and our example would be parsed as
primary-expression ('.' identifier ('.' identifier))
The solution to this is somewhat dependent on the particulars of your parser. I have been using Parsec for parsing. The left recursive grammar would be written as
This builds an abstract syntax tree and nicely matches the structure of the
grammar. Unfortunately, it doesn't complete (as we have seen). The advice on most
blogs I have seen for dealing with left recursion is to use the
combinator. This would be a nice solution, but it assumes all the subexpressions
have the same type as the result of combining them.
chainl1 can be adapted to
handle the case of subexpressions of type
a and a result of type
b. This was
useful for handling arithmetic expressions.
This works beautifully for defining the arithmetic of a language, which is often many levels deep and each level has similar structure to all the others. Postfix expressions lack this regular structure, so we need a slightly more involved approach. It might be possible to make an even more general chain combinator, which associates each operator parser with an expression parser, but it would only be used in this location, so it is simpler to just handle this case independently. The structure is similar to chainl1', but with several alternatives
This runs nicely and still preserves clarity.
Coming up with this solution made showed just how valuable having the source for a solution that does something similar to what you want is, and how valuable having a package management system (Hackage) that can show source is.
Edit: I messed up the final version of
The post was written before I had completed the loop inherent in the C expression type,
and consequently couldn't test the parser.
This is a case where implicit typing bit me.
rest should have been
but hadn't annotated it. My original version was taking a
PasecT as its argument,
which lead back to the looping problem.