Scheming, part 7. The Reader.
Posted by kernelbob on May 21, 2009
I haven’t ‘blogged about my Scheme interpreter in ages. Development stopped from January through March, but I’ve been working on it again. Here are links to the earlier articles.
- Scheming, part 6
- Scheming, part 5
- Announcing kbscheme
- Scheming, part 3
- Scheming, part 2
- Scheming, part 1
- Not another Scheme interpreter!
The biggest new development is the reader. Most developers call them parsers, but Lisp people call them readers. I have written six readers for Scheme now, and I’ve effectively completed a class in advanced parsing theory. The current reader works very well, and I do not want to write another.
Lisp and Scheme are renowned for their simple syntax. While Scheme is a lot simpler than, say, Perl, it’s still a 30 year old language, and it’s been accreting features the whole time. So the Revised6 Report on the Algorithmic Language Scheme (R6RS) takes six dense pages to describe Scheme’s syntax.
The most interesting feature to parse is the datum comment, which comments out one entire datum. (A datum is roughly the same as an expression.)
Here are some examples. The colored text is commented out.
> (+ 3 #; 4 5) 8 > (+ 3 #; (- 4 1) 5) 8 > (+ 3 4 #; (- 5 2 #; (parenthetically, 5 - 2 is 3) > ,still commenting) 6) 13 > '(a . b #;not-c #;(not-d-either)) (a . b)
In the first example, a simple datum, the number 4, is commented out. In the second, the comment is a larger expression. The third example has one comment, from the first #; through commenting). That comment happens to contain another comment. The fourth example has two adjacent comments. It would be syntactically erroneous if either comment were uncommented, since only a single datum is allowed to the right of a dot.
Oh, and just for fun, Scheme has three other kinds of comments. Single-line comments start at a semicolon and continue to the end of the line. (There are seven encodings for end of line.) Nested comments start with #| and end with |# and may contain other nested comments. And the exact string #!r6rs is a comment, too, for reasons that are almost obvious to Unix hackers and irrelevant to everyone else. I present this info as evidence that Scheme is a little overweight.
Anyway, when we last looked at my Scheme interpreter, it had a simple recursive descent reader that could recognize symbols, integers, and lists.
The first rewrite was in Yacc. (Bison, actually.) I wrote a Yacc grammar that (eventually) handled datum comments, vectors, bytevectors, dotted notation, and the eight special quoting symbols: quote, quasiquote, unquote, unquote-splicing, syntax, quasisyntax, unsyntax, and unsyntax-splicing. (Did I mention that Scheme is not so lean anymore?)
Actually, the lexical analyzer was harder than the grammar — I wrote a traditional hardcoded C scanner (but with Unicode instead of ASCII). The scanner hasn’t been reimplemented since. At this point, it does not handle non-decimal or inexact numbers (floats to you and me), strings, or character literals. But it recognizes all the rest of R6RS Scheme.
But the Yacc parser had a fatal flaw. Remember back in part 3 where I complained about how hard it is to find all the root pointers (pointers to live heap objects from outside the heap)? It’s even harder in code that Yacc writes for you.
So the Yacc parser was GC-unsafe. For several months while it was the only working parser, Scheme wouldn’t reliably survive a GC. That was fun.
The next attempt was a table-driven LR(0) parser. I just read the Wikipedia page and turned it into code for a parser generator, then called that generator at startup time. But my grammar wasn’t LR(0), because of the blasted datum comments.
The fourth parser was an LALR parser generator. I got a prototype written in Python working, then I translated it into C. After I got it working, I realized that the grammar wouldn’t handle the last datum comment example above. I couldn’t figure out a solution with an LALR grammar. So I started over again.
Lisp feels as though it wants to be parsed LL, anyway. You just know that most Lisp parsers are recursive descent. So I toyed with the idea of writing a hardcoded LL parser, and I made some false starts. But I kept wanting to make it table driven, and there’s this body of literature about table driven parsers…
The next reader — What number are we up to now? Six? — The next reader, which ended up being the last, I hope, starts with an LL(0) grammar, constructs an LL parser at startup time, and interprets that parser. As before, I prototyped it in Python, then translated it to C.
That’s it, then. End of story. The sixth one stayed up. (Spamalot finally came to Eugene two weeks ago, and there was much rejoicing.)
Actually, no. We’re not done yet. At that point, I had a parser and it could distinguish valid programs from invalid ones, but it still didn’t construct the tree structure that the reader should return. I started down several paths, but finally came up with something that works. I’m not sure it’s optimal, though.
Some tokens, e.g., numbers and symbols, have values. Other, such as parentheses, don’t. When the parser consumes a token with a value, it pushes that value onto its output stack. Also, when it begins to match a grammar production, it pushes that production’s action onto the same output stack. After the whole expression is parsed, the output stack has a sequence of syntax construction elements in RPN order.
For example, given this input expression,
(a (b 3) c)
the parser constructs this stack. (Written as a list because how else would you represent a stack in Lisp?)
(END_SEQUENCE c END_SEQUENCE 3 b BEGIN_LIST a BEGIN_LIST)
The upper case symbols are actions. They aren’t actually symbols — they’re way too clever to go into here, so go look at the source if you care.
A postprocessing step called build() traverses the output stack and constructs the desired list.
This seems like a lot of work, but I had the idea that the reader shouldn’t be mutating list structure, that it’s better to generate some garbage than to mutate structures after creating them. If you accept that, the input has to be parsed left to right, and the result has to be built right to left.
Sometime during all this, I toyed with the idea of rewriting the scanner with regular expressions, sort of a home-made lex. I never got very far on that. A hardcoded, nonrecursive scanner is easier to read and to write than a hardcoded, nonrecursive parser. Also, my original scanner wasn’t fatally broken like the Yacc grammar.
For the record, here is the final grammar. Actions, for those productions that have them, are listed at the right. ε represents the empty sequence.
program ::= comment program program ::= datum program ::= ε datum ::= EXACT_NUMBER datum ::= SIMPLE datum ::= ( interior ) BEGIN_LIST datum ::= [ interior ] BEGIN_LIST datum ::= #( elements ) BEGIN_VECTOR datum ::= #vu8( bytes ) BEGIN_BYTEVEC datum ::= ABBREV datum ABBREV interior ::= datum tail interior ::= comment interior interior ::= ε END_SEQUENCE tail ::= datum tail tail ::= comment tail tail ::= . comments datum comments DOT_END tail ::= ε END_SEQUENCE elements ::= datum elements elements ::= comment elements elements ::= ε END_SEQUENCE bytes ::= EXACT_NUMBER bytes bytes ::= comment bytes bytes ::= ε END_SEQUENCE comment ::= #; datum DISCARD comments ::= comment comments comments ::= (empty)
I’m glad I did it. If you’re going to bang your head against parser generators, you could pick a worse language than Scheme. And the ability to write parser generators is a good skill to have.
Next installments: libraries and macros.