chickadee » introduction-to-silex

Introduction to SILex

OK, so what is lexing? Lexing means reading. You're already familiar with read to read Scheme code. SILex lets you make readers for other types of code, not just Scheme code.

Here is a super simple example to get you started.

Create the file /tmp/my-test that contains this:

lett [a-yA-Y]
digg [0-9]
zedd [zZ]

%%

{lett} (list 'letter yytext)
{digg} (list 'numberic yytext)
{zedd} 'this-just-aint-right
<<EOF>> #!eof

Then, run a script or use the REPL to execute the following:

(import silex)
(lex "/tmp/my-test" "/tmp/my-lexer.scm")

Now you no longer need to import silex or refer back to your "my-test" file unless they change. Instead, in your real program you are going to use that lexer file that you just generated.

(include "/tmp/my-lexer.scm")
(lexer-init 'string "1abc23zdEF66z88")

(That is lexing on a string. You can also lex on a port or a procedure. Also, instead of just including that lexer scheme file you can wrap it in a module or whatever you like to do to keep your own projects tidy.)

That creates some helper functions to help you debug (lexer-get-line, lexer-getc and lexer-ungetc) and the main deal, what you came here for. lexer itself.

lexer is a thunk (uh, I mean, it's a procedure that doesn't take any arguments) that when called gives you the next token. So you can call it a couple of times and it'll give you, in this example, (numberic "1") then (letter "a") then (letter "b").

Port iterator pro tip

If your lex grammar file includes an eof line, like my example did above, then in the module (chicken port) there are the port iterators that you can use. For example (after re-initing the lexer with "1abc23zdEF66z88"),

(port-map identity lexer)

Evals to

((numberic "1")
 (letter "a")
 (letter "b")
 (letter "c")
 (numberic "2")
 (numberic "3")
 this-just-aint-right
 (letter "d")
 (letter "E")
 (letter "F")
 (numberic "6")
 (numberic "6")
 this-just-aint-right
 (numberic "8")
 (numberic "8"))

How the lex grammar file works

OK, so the main documentation doesn't tell you any of the preceding things: how to use lex to create a lexer file, and how to then include that to get the lexer-init producer, which in turn creates the lexer procedure, and then how to use that lexer procedure. This went undocumented for 24 years unless you read the source code which is in French.

But what the main documentation does do is tell you the format of lex grammar file, so go back and refer to it for details.

But just so you know, some thing that tripped me up:

To refer back to the "macros" in the top half, you need to wrap them in curly braces when you use them. (Sort of like how in shell you don't put the dollar sign when you assign valuables, only when you refer to them.)

I know, I know, the text says as much (in several places) but this is what was tripping me up for the longest time since I somehow missed that.

Second of all, the macros are there for when you need them, to cut out repetition. The rules in the rules part can also use regex.

I used some "unnecessary" macros in my example just to show you how to use and refer back to macros.

But in reality, the following file would work just as well:

%%

[a-yA-Y] (list 'letter yytext)
[0-9] (list 'numberic yytext)
[zZ] 'this-just-aint-right
<<EOF>> #!eof

Third of all, what do the actions (each rule is a pattern and, optionally, an action) actually do? They are expressions that the lexer gives you instead of giving you the token it read. The variable yytext contains that token.

So, for example, if you create a lexer from /tmp/your-basic-space-splitter that looks like this:

%%

\32 (yycontinue)
[a-z]+ yytext
<<EOF>> #!eof

then (after generating and including the lexer file),

(lexer-init 'string "some boring words go here")
(port-map identity lexer)

⇒ ("some" "boring" "words" "go" "here")

I hope you'll have lots of fun with SILex!