Bobbing for Kernels

See Bob. See Bob bob. Bob, Bob, bob!

Scheming, part 7. The Reader.

Posted by kernelbob on May 21, 2009

I haven’t ‘blogged about my Scheme interpreter in ages.  Development stopped from January through March, but I’ve been working on it again.  Here are links to the earlier articles.

The biggest new development  is the reader.  Most developers call them parsers, but Lisp people call them readers.  I have written six readers for Scheme now, and I’ve effectively completed a class in advanced parsing theory.  The current reader works very well, and I do not want to write another.

Lisp and Scheme are renowned for their simple syntax.  While Scheme is a lot simpler than, say, Perl, it’s still a 30 year old language, and it’s been accreting features the whole time.  So the Revised6 Report on the Algorithmic Language Scheme (R6RS) takes six dense pages to describe Scheme’s syntax.

The most interesting feature to parse is the datum comment, which comments out one entire datum.  (A datum is roughly the same as an expression.)

Here are some examples.  The colored text is commented out.

> (+ 3 #; 4 5)
8
> (+ 3 #; (- 4 1) 5)
8
> (+ 3 4 #; (- 5 2 #; (parenthetically, 5 - 2 is 3)
> ,still commenting) 6)
13
> '(a . b #;not-c #;(not-d-either))
(a . b)

In the first example, a simple datum, the number 4, is commented out.  In the second, the comment is a larger expression.  The third example has one comment, from the first #; through commenting).  That comment happens to contain another comment.  The fourth example has two adjacent comments.  It would be syntactically erroneous if either comment were uncommented, since only a single datum is allowed to the right of a dot.

Oh, and just for fun, Scheme has three other kinds of comments.  Single-line comments start at a semicolon and continue to the end of the line.  (There are seven encodings for end of line.)  Nested comments start with #| and end with |# and may contain other nested comments.  And the exact string #!r6rs is a comment, too, for reasons that are almost obvious to Unix hackers and irrelevant to everyone else.  I present this info as evidence that Scheme is a little overweight.

Anyway, when we last looked at my Scheme interpreter, it had a simple recursive descent reader that could recognize symbols, integers, and lists.

The first rewrite was in Yacc.  (Bison, actually.)  I wrote a Yacc grammar that (eventually) handled datum comments, vectors, bytevectors, dotted notation, and the eight special quoting symbols: quote, quasiquote, unquote, unquote-splicing, syntax, quasisyntax, unsyntax, and unsyntax-splicing.  (Did I mention that Scheme is not so lean anymore?)

Actually, the lexical analyzer was harder than the grammar — I wrote a traditional hardcoded C scanner (but with Unicode instead of ASCII).  The scanner hasn’t been reimplemented since.  At this point, it does not handle non-decimal or inexact numbers (floats to you and me), strings, or character literals.  But it recognizes all the rest of R6RS Scheme.

But the Yacc parser had a fatal flaw.  Remember back in part 3 where I complained about how hard it is to find all the root pointers (pointers to live heap objects from outside the heap)?  It’s even harder in code that Yacc writes for you.

So the Yacc parser was GC-unsafe.  For several months while it was the only working parser, Scheme wouldn’t reliably survive a GC.  That was fun.

The next attempt was a table-driven LR(0) parser.  I just read the Wikipedia page and turned it into code for a parser generator, then called that generator at startup time.  But my grammar wasn’t LR(0), because of the blasted datum comments.

The fourth parser was an LALR parser generator.  I got a prototype written in Python working, then I translated it into C.  After I got it working, I realized that the grammar wouldn’t handle the last datum comment example above.  I couldn’t figure out a solution with an LALR grammar.  So I started over again.

Lisp feels as though it wants to be parsed LL, anyway.  You just know that most Lisp parsers are recursive descent.  So I toyed with the idea of writing a hardcoded LL parser, and I made some false starts.  But I kept wanting to make it table driven, and there’s this body of literature about table driven parsers…

The next reader — What number are we up to now?  Six? —  The next reader, which ended up being the last, I hope, starts with an LL(0) grammar, constructs an LL parser at startup time, and interprets that parser.  As before, I prototyped it in Python, then translated it to C.

That’s it, then.  End of story.  The sixth one stayed up.  (Spamalot finally came to Eugene two weeks ago, and there was much rejoicing.)

Actually, no.  We’re not done yet.  At that point, I had a parser and it could distinguish valid programs from invalid ones, but it still didn’t construct the tree structure that the reader should return.  I started down several paths, but finally came up with something that works.  I’m not sure it’s optimal, though.

Some tokens, e.g., numbers and symbols, have values.  Other, such as parentheses, don’t.  When the parser consumes a token with a value, it pushes that value onto its output stack.  Also, when it begins to match a grammar production, it pushes that production’s action onto the same output stack.  After the whole expression is parsed, the output stack has a sequence of syntax construction elements in RPN order.

For example, given this input expression,

(a (b 3) c)

the parser constructs this stack.  (Written as a list because how else would you represent a stack in Lisp?)

(END_SEQUENCE c END_SEQUENCE 3 b BEGIN_LIST a BEGIN_LIST)

The upper case symbols are actions.  They aren’t actually symbols  — they’re way too clever to go into here, so go look at the source if you care.

A postprocessing step called build() traverses the output stack and constructs the desired list.

This seems like a lot of work, but I had the idea that the reader shouldn’t be mutating list structure, that it’s better to generate some garbage than to mutate structures after creating them.  If you accept that, the input has to be parsed left to right, and the result has to be built right to left.

Sometime during all this, I toyed with the idea of rewriting the scanner with regular expressions, sort of a home-made lex.  I never got very far on that.  A hardcoded, nonrecursive  scanner is easier to read and to write than a hardcoded, nonrecursive parser.  Also, my original scanner wasn’t fatally broken like the Yacc grammar.

For the record, here is the final grammar.  Actions, for those productions that have them, are listed at the right.  ε represents the empty sequence.

program  ::= comment program
program  ::= datum
program  ::= ε
datum    ::= EXACT_NUMBER
datum    ::= SIMPLE
datum    ::= ( interior )                   BEGIN_LIST
datum    ::= [ interior ]                   BEGIN_LIST
datum    ::= #( elements )                  BEGIN_VECTOR
datum    ::= #vu8( bytes )                  BEGIN_BYTEVEC
datum    ::= ABBREV datum                   ABBREV
interior ::= datum tail
interior ::= comment interior
interior ::= ε                              END_SEQUENCE
tail     ::= datum tail
tail     ::= comment tail
tail     ::= . comments datum comments      DOT_END
tail     ::= ε                              END_SEQUENCE
elements ::= datum elements
elements ::= comment elements
elements ::= ε                              END_SEQUENCE
bytes    ::= EXACT_NUMBER bytes
bytes    ::= comment bytes
bytes    ::= ε                              END_SEQUENCE
comment  ::= #; datum                       DISCARD
comments ::= comment comments
comments ::= (empty)

I’m glad I did it.  If you’re going to bang your head against parser generators, you could pick a worse language than Scheme.  And the ability to write parser generators is a good skill to have.

Next installments: libraries and macros.

Advertisements

3 Responses to “Scheming, part 7. The Reader.”

  1. […] Here is the guide for one man’s path of implementing a Scheme interpreter in C. Why did he do it? […]

  2. […] 2009 I’ve written here before about my efforts to write a Scheme interpreter from scratch.  The last update was in May, but I’m still working on it.  I’ve made 60 git commits since then.  The major thing […]

  3. Jack said

    Like what you did. Wishing you and yours a very happy and prosperous new year !

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: