|
![](/i/fill.gif) |
On 02/04/2012 10:07 AM, Invisible wrote:
> Yesterday I was sitting at my PC, listening to Alisha's Attic and coding
> in Java.
>
> So... doing what I was doing 10 years ago, then. :-P
It's interesting, actually. I needed a parser. Since Java doesn't have
any parsing libraries, I started writing one. But I made the mistake of
trying to write it the way you'd write in in Haskell. Pro tip: DON'T DO
THAT!
You see, in Haskell, it's trivial to write a function that takes some
other function as its argument to do some key operation, or to implement
the continuation, or whatever. But in Java, that's wicked-hard. You'd
have to write an entire class for every "function" you want to pass as a
parameter. (The fact that "anonymous inner classes" even exist tells you
how badly some Java programmers want first-class functions...)
In Haskell, you'd write a parser combinator library. You have a couple
of primitive parsers, and then you have functions that construct more
sophisticated parsers from simpler ones. Trouble is, in Java every
"parser" has to be an entire class. And where in Haskell you can just
say "foo 1 2 3" and get a 1-argument function, in Java you must say
public class Foo extends Parser
{
private int arg1, arg2, arg3;
public Foo(int a1, int a2, int a3)
{
this.arg1 = a1;
this.arg2 = a2;
this.arg3 = a3;
}
}
This is before you even /implement/ the actual functionality of the
thing! In short, Java makes this kind of thing excruciatingly verbose.
Also, unlike Haskell, it's not designed to be used this way at all, and
so it doesn't optimise all this overhead away. I ended up writing a huge
mountain of very complex and buggy code that's a nightmare to understand.
So then I took a step back, and tried to write a parser the way a Java
programmer might do it. And you know what? The lexer is pretty simple.
You do something like
public class Lexer
{
private String buffer;
private int index;
public Lexer(String in)
{
this.buffer = in;
this.index = 0;
}
private char Peek() {return this.buffer.chatAt(this.Index);}
private void Next() {this.index++;}
private boolean EOF() {return this.Index < this.buffer.length();}
public Token GetNextToken() {...}
}
There's basically two types of token: Ones that consist of just one
special character, and ones that consist of a run of characters of a
certain type. In Haskell, you'd abstract out these patterns, but in Java
that's too painful. The single-character case is trivial to handle, and
there are only two types of character run to worry about, so the code
repartition isn't too bad.
Next, you need to parse the tokens into a parse tree. Now there's
several ways to do that. The way I tried to do it was Dijkstra's
shunting algorithm. In it's original form, it takes a list of tokens and
builds a Polish reversed list. But it's easy to modify it to build a
parse tree instead.
The original algorithm looks like
- Read token.
- If operand, output.
- If operator, push to the operator stack. (Possibly unstacking and
outputting some existing operators first.)
- At the end, unstack all remaining operators as above.
All you gotta do is keep the operands on a stack as well, and when you
unstack an operator, you pull its operands off the operand stack, make a
little expression tree out of it, and push that back to the operand
stack. At the end, there should be one operand left, which is your
expression tree.
The really tricky part is handling brackets and function calls. In fact,
I discovered that my parser fails for functions that take zero
arguments. It appears the only way to fix this is for the parser to
"know" how many arguments each function is supposed to take - which is a
dependence I really don't want to add. (The alternative is to hack
around the problem with status flags, which is a bit messy.)
Apart from empty function calls, my parser works. Trouble is, every
possible operator token becomes a whole class, because every token
behaves differently. (In particular, most tokens need to be turned into
/expressions/ at some point.) So you end up with a rather large
profusion of classes!
I ended up with 12 concrete token classes, plus about 5 abstract classes
to handle shared functionality. Add on one class for the parser (which
manages the two stacks), one class for the lexer, and one class for the
parser exception (in Java, an exception is a class) and we have a grand
total of 20 classes to implement this simple little parser. And it
doesn't even give decent error messages!
Post a reply to this message
|
![](/i/fill.gif) |