POV-Ray: Newsgroups: povray.off-topic: Very long post

POV-Ray : Newsgroups : povray.off-topic : Very long post		Server Time 7 Sep 2024 09:21:11 EDT (-0400)

Goto Latest 10 Messages

Next 10 Messages >>>

From: Invisible
Subject: Very long post
Date: 23 Sep 2008 07:32:18
Message: <48d8d3c2$1@news.povray.org>

# Preface #

OK, so the muse has taken me. I want to write something. I realise 
nobody here really cares, but if you're bored maybe you could take a 
read and tell me what you think.

(Yes, I *know* Haskell sucks and only a total retard would use such a 
stupid and irrelevant language. I think we've covered that one. I'm not 
asking whether you think Haskell is cool, I'm asking what you think of 
my powers of exposition. Does what I've written *make sense*? That's the 
question I'm interested it.)


# Introduction #

There are many common and well-known idioms in OO languages. For 
example, look at any large project and you'll probably find that 
somewhere, somebody has written a class which is nearly complete except 
for a few abstract methods, which all other functionallity is defined in 
terms of. Then they wrote several concrete subclasses that work purely 
by implementing these abstract methods in various ways.

It's such a common technique that it hardly needs describing. Everybody 
who's ever seen OOP will know what this is about.

Haskell is a much rarer language. It has its own idioms, most of which 
are far less well-known. One very widely used technique is the 
construction of "combinator libraries". Such a library consists of a 
small set of primitive "things", and a set of combinators for combining 
simple things into more complex things. Usually there will also be a 
collection of ready-made things which are either very commonly needed or 
non-obvious in construction.

An absolutely classic application of this technique is the contruction 
of text parsers. I'd like to show off the technique by walking you 
through a high-level description of Haskell's "Parsec" library. (There 
will not be any actual Haskell syntax; this is just a description of 
concepts.)


# Parsec #

## Parsing stuff ##

Suppose you want to parse a text string to extract some kind of data 
from it. If you're working in something like Java, right now you're 
probably thinking something along the lines of designing some kind of 
"paser object" that has an "input buffer" and various methods for 
mutating the paser state. Maybe it's based on some sort of finite state 
machine?...

Parsec takes a completely different approach - and that is the subject 
of this text.

The Parsec library defines a "parser" data type. What this actually is 
or how it works doesn't matter; a Parsec client sees only the type name. 
Parsec also provides a Run() function which takes a parser and some text 
and "runs" the paser over it.

There are a few things to explain about parsers. First of all, when a 
parser is run, it can consume zero or more characters. That is, a parser 
can terminate immediately without consuming any input at all, or a 
single parser can swallow millions of characters.

Second, when you run a parser, it can "succeed" or "fail". When a parser 
succeeds, it returns an item of data. This can be of any data type 
(e.g., it could be an integer, a string, or a list of binary trees). 
When a parser "fails", this is not a fatal program error, it just means 
that the parser returns an internal Parsec error description data 
structure. This will become clearer later.

The Run() function returns a structure indicating whether parsing 
succeeded or failed. If successful, it returns whatever the parser 
itself returned. If failed, it returns an opaque error object. There is 
a function to conviniently transform this into human-readable text, but 
you can also programmatically query it (e.g., to get the line number of 
the error so you can hilight it on screen or something).


## The parsers ##

So, that's what a parser "is", in Parsec's terminology. Now, what 
parsers does Parsec actually implement? Well, it implements the following:

- P_Fail(msg) is a parser that always fails immediately, returning "msg" 
as the error message.
- P_Succeed(x) is a parser that always succeeds immediately (i.e., 
consuming 0 characters of input). It always returns the same value, "x".
- P_Any() is a parser that reads 1 character of input and returns that 
character. (It fails with a descriptive message if EOF.)
- P_Char(x) will read 1 character of input and return that character *if 
and only if* that character happens to be "x". Otherwise it fails with 
an error like "unexpected character 'J'; expected 'F'".

(The names are made-up, but Parsec really does have all of these.)

As you can see, these parsers are not so much "simple" as "trivial". 
Their internal implementation is slightly more complex than it would 
immediately seem, but even so, each one is only a few lines of Haskell.

So what can we *do* using these parsers? Well, not a lot, really. To do 
anything useful, we need some combinators...


## The Chain() combinator ##

The first combinator we're going to look at is something that I'm going 
to call Chain(). Consider, for example,

   parserX = Chain(parser1, parser2);

So Chain() takes two parsers and returns a parser (much like "+" takes 
two numbers and returns a number). But what does this new parser *do* 
when you run it?

- First, it runs parser1.
- If parser1 succeeds, its result is thrown away, and parser2 is run.
- If parser2 succeeds, its result is returned as the overall result of 
parserX.

In other words, it runs the first parser and then the second parser. But 
what if one or other parser fails?

- If parser1 fails, parser2 is not run at all, and parserX fails overall 
(with the same error message as parser1).
- If parser1 succeeds but parser2 fails, parserX fails overall (with the 
same error message as parser2).

It should be clear at this point that Chain() is associative - i.e., that

   parserX = Chain(Chain(x, y), z);
   parserY = Chain(x, Chain(y, z));

are two equivilent parsers. It should also be clear that Chain() 
therefore allows us to take an arbitrary number of parsers and "chain 
them together in sequence" to build one big parser that will run the 
component parsers one after another. And, finally, the first time any 
parser fails, all of the following parsers are automatically skipped and 
the overall parser fails. (A bit like throwing an exception - except 
that this is a library feature, not a hard-wired language construct.)

Now, writing Chain(Chain(Chain(...))) gets tedious quickly, so from now 
on I'm going to use a simple shorthand. Instead of "Chain(x, y)", I'm 
going to just write "x >> y". Just remember that it means "execute x and 
then execute y" (with a few additional details, as per above).

Using this new Chain() / ">>" combinator, I can write

   P_Char('H') >> P_Char('E') >> P_Char('L') >> P_Char('P');

which represents a parser which succeeds if the input begins "HELP", and 
fails otherwise. (Note that this includes checking for EOF.) As it 
stands, however, when this parser succeeds it returns 'P', which isn't 
fantastically useful. However, consider the following trick:

   P_Char('H') >> P_Char('E') >> P_Char('L') >> P_Char('P') >> 
P_Succeed("HELP");

Recall that P_Succeed() always succeeds and returns the specified value. 
However, >> will only execute this if all the preceeding parsers 
succeed. In other words, this parser does exactly the same thing as the 
previous one, except that it returns "HELP" instead of just 'P' as its 
result.

It turns out that parsers like the one above are extremely common, so 
Parsec provides us with a shortcut for writing them:

   function P_String(string)
   {
     out = P_Char(string[0]);

     for (int i=1; i < string.size(); i++)
     out = Chain(out, P_Char(string[i]));

     return Chain(out, P_Succeed(string));
   }

Now we can just say P_String("HELP") to generate the above parser. That 
should save us some typing!

So, we now have a parser that can read an arbitrary string, checking for 
EOF at each stage, and output readable error messages. (E.g., if you do 
P_String("HELP") and give it "HEAD", you get "error on line 1, column 3: 
unexpected character 'A'; expected 'L'".)

But this still isn't terribly exciting; a simple string comparison would 
do the same job!


## The Fork() combinator ##

The next combinator is called Fork(). It also takes a pair of parsers 
and returns a parser. However, just like "+" and "-" return different 
numbers, this combinator returns a slightly different parser:

   parserX = Chain(parser1, parser2);
   parserY = Fork(parser1, parser2);

As you know, parserX will run parser1 and then run parser2 only if 
parser1 succeeds. However, parserY runs parser2 only if parser1 *fails*. 
So Chain() means "do this and then to that", whereas Fork() means "do 
this or else do that". Where Chain() combines parsers into a sequential 
chain, Fork() combines parsers into a list of alternatives. For example:

   bool_parser = Fork(P_String("YES"), P_String("NO"));

This parser accepts either the word "YES" *or* the word "NO", and 
returns the word read. (If the input is anything else, it fails.)

So you can now see that when a parser "fails", this is a normal and 
ordinary part of how complex parsers work internally. "Let's parse this 
as an integer. Oh, that failed. OK, let's parse it as a variable. Oh, 
that failed too. OK, so let's parse it as a string literal..."

Notice that if *both* parsers fail, the overall parser fails, but the 
error messages are *merged*. (This is why Parsec errors are intricate 
data structures rather than simple text messages.) For example, if 
Fork(P_Char('X'), P_Char('Y')) were to fail, it might say

   Unexpected character 'J'; expecting 'X' or 'Y'.

The Parsec client gets this functionallity automatically as a 
consequence of merely using Fork(). This makes for easy automatic error 
handling.

Again, Fork() is associative like Chain() is:

   Fork(Fork(x, y), z) === Fork(x, Fork(y, z))

And again, I'm going to write "x <|> y" instead of "Fork(x, y)" from now on.

Using this notation, I can write

   colour_parser =
     P_String("RED")
     <|>
     P_String("GREEN")
     <|>
     P_String("Blue");

This accepts any of the given words, or it fails (with a message like 
"expecting 'RED' or 'GREEN' or 'BLUE'"; the P_String() shortcut 
prettifies the error messages slightly).

By using Chain() and Fork() together, I can define parsers such as

   expression_parser =
     P_Char('X')
     <|>
     (P_Char('[') >> expression_parser >> P_Char(']'));

This parser will accept any of the following strings:

   X
   [X]
   [[[X]]]
   [[[[[[[[X]]]]]]]]

So, basically, an "X" surrounded by any number of square brackets - but 
the open and closing brackets *must* match up! (And if they don't, 
you'll get a helpful error message.) This is slightly less trivial than 
the parsers we've seen so far, but still hardly earth-shattering.


## Optional parsing ##

Take a look at the following parser:

   P_Char('X') <|> P_Succeed('?')

If the next character is an "X", the first parser succeeds and returns 
an "X". However, if it fails, Fork() / "<|>" will run the second parser, 
which doesn't even *look* at the input, it just returns "?".

This turns out to be quite common, so Parsec provides a shortcut:

   function P_Optional(parser, default)
   {
     return Fork(parser, P_Suceed(default));
   }

Using this, I can define the following:

   bracket_parser =
     P_Char('[') >> expression_parser >> P_Char(']');

   expression_parser =
     (P_Char('X') <|> bracket_parser) >>
     P_Optional(expression_parser, "");

This is almost the same as the previous expression parser, except for 
the P_Optional() part. But that part means that now all of the following 
strings will be accepted:

   X
   XXX
   X[X]X
   [[[X[X]X]]]
   [[[X]]]X[[[[X]]]]
   [X[X[[X]]X]X]X

In other words, *any expression* composed of X's and balanced square 
brackets.

This, then, is something that a trivial string compare could not do. 
(Although it wouldn't be hard to manually implement such a parser.)


## Multiple parsing ##

Another thing we could very well want to do is parse multiple copies of 
something. (E.g., a "number" consists of multiple "digits".) This is 
also extremely common, so Parsec provides a shortcut:

   function P_Many1(parser1)
   {
     return (parser1 >> P_Optional(P_Many(parser1), []));
   }

This parses one of something, and then optionally parses some more of 
them by recursively calling itself. This lets you parse >= 1 of 
something; another Parsec shortcut allows >= 0 of something. (Beware of 
infinite loops! This parser must be used cautiously.) There is also a 
variant for parsing exactly N of something.

Another common requirement is to parse a sequence of things with 
seperators/terminators, and Parsec provides shortcuts for those too.


## The SuperChain() combinator ##

So far, most of our parsers don't "return" any interesting data. They 
succeed for some inputs and fail for others, but they don't return much 
data to the caller. (Typically just the last character of input or some 
such.) How about we remedy that?

Consider the following parser:

   parser1 >> parser2

Of course, this is really

   Chain(parser1, parser2)

And I already told you that Chain() will execute the first parser, 
*throw the result away* and then execute the second parser. Well suppose 
we'd actually like a parser that returns *both* results. How do we do that?

Well, it's impossible using Chain(). But there is another combinator 
called SuperChain() that works in a slightly different way:

   SuperChain( <parser> , <function> )

It first runs the parser. If that fails, the whole thing fails, as 
before. But assuming the parser succeeds, SuperChain() takes the 
parser's result and passes it to the function. The function is required 
to return another parser, which SuperChain() then runs.

This has two effects. The first effect is that the function can look at 
the token (or whatever) just read, and perform some completely arbitrary 
algorithm to decide what to do next. It then returns the parser it would 
like to be run next. This allows the implementation of 
"context-sensitive grammers".

For example, in the TeX language, usually commands start with "\". 
However, this is configurable. It can be changed half way through a 
file! (The standard "\verb" command actually does this.) But using 
Parsec, you can write a parser which will be able to cope if the 
character is changed; just write a function that returns a different 
parser! [Actually this is a simplification, but you get the flavour.]

The other thing it allows is a solution to our problem above: we want to 
run two parsers and return both of their results. This can be solved in 
the following way:

   parserX = SuperChain(parser1, foo);

   function foo(x)
   {
     function bar(y)
     {
       return P_Succeed([x, y]);
     }

     return SuperChain(parser2, bar);
   }

When parserX is run, it will run parser1 and pass its result to foo(). 
foo() then returns a SuperChain() parser which runs parser2 and passes 
its result to bar(). Notice that bar() is a nested function within 
foo(), and hence can refer to "x", which is foo's function argument. The 
bar() function simply returns a P_Succeed parser which returns the data 
we want.

In other words, parserX runs parser1 and then parser2 and returns both 
of their results. (Not forgetting that either of these parsers might 
fail, producing an error description, and the flow control is invidibly 
handled by SuperChain() in the same way as Chain() does.)

That seems like an aweful lot of work just to perform such a trivial 
task! Fortunately, since this kind of thing is very common in Haskell, 
it has some special syntax sugar for making it very easy to write. I 
won't use the exact syntax, but suffice it to say the above can be 
written to look something like

   parserX =
     begin
       parser1 --> x;
       parser2 --> y;
       P_Succeed([x, y]);
     end;

which is obviously a lot clearer. Basically

   ...
   parser1;
   parser2;
   parser3;
   ...

becomes calls to Chain(), while

   ...
   parser1 --> x;
   parser2 --> y;
   parser3 --> z;
   ...

becomes calls to SuperChain() with auxhilery functions generated 
inbetween. (You don't need to comprehend the precise algorithm, but 
trust me, it works.)

Using this technology, we can write something like

   bracket_parser =
     begin
       P_Char('[');
       expression_parser --> expr;
       P_Char(']');
       P_Succeed([expr]);
     end;

   X_parser =
     begin
       P_Char('X');
       P_Succeed(['x']);
     end;

   term_parser = X_parser <|> bracket_parser;

   expression_parser =
     begin
       term_parser                       --> term1;
       P_Optional(expression_parser, []) --> term2;
       P_Succeed(term1.join(term2));
     end;

This will parse a string such as "[X[X]X]" and return a list object like 
['X',['X'],'X']. And so, finally, we have a parser that generates a 
nontrivial data structure as its result.


## Caution ##

Suppose you wanted to write a Lambda expression parser. The grammar for 
that looks like this:

   <expression> =
     <var name> |
     "(" <expression> ")" |
     <expression> <expression> |
     "\" <var name > "->" <expression>

You might attempt to implement this as

   expression_parser =
     ...
     <|>
     (
       begin
         expression_parser --> expr1;
         expression_parser --> expr2;
         P_Succeed([expr1, expr2]);
       end;
     )
     <|>
     ...

Don't do that! You have just implemented a "left-recursive grammar", 
which Parsec cannot handle. You must first refactor it to avoid 
left-recursion. To understand why, consider this:

   OK, so how do I parse an expression? OK, well one of the parse rules 
says to start by parsing an expression. OK, so how do I parse an expression?

The parser will sit an in infinite loop, frantically chasing its tail 
and making no progress. The correct solution here is to parse one 
expression and then use P_Optional() to attempt to parse a second one, 
if there is one.

   term_parser = ...

   expression_parser =
     begin
       term_parser --> term1
       P_Optional(expression_parser, empty_expression) --> term2
       ...combine terms...
     end;

Notice how the *first* line parses a single term, while the *second* 
line recursively parses an entire expression. That way expressions of 
arbitrary size are parsable, but the infinite recursion is avoided.


## Error messages ##

Suppose you write an expression parser and tried to parse "123xyz". You 
might get an error like

   "Error on line 1, column 4: Unexpected character 'x'; expecting '0', 
'1', '2', '3', '4', '5', '6', '7', '8' or '9'."

While this error is completely correct technically, it's a tad verbose. 
Once you start writing parsers for sophisticated grammars, you get error 
messages indicating that all sorts of punctuation marks would be legal 
here, but it can become a little overhelming trying to figure out what 
the parser is actually unhappy about.

Parsec provides the P_Name() function. So for example:

   digit_parser =
     P_Name(
       P_Char('0') <|> P_Char('1') <|> ...,
       "digit"
     );

   number_parser = P_Many1(digit_parser);

Now you get the following error message:

   "Error on line 1, column 4: Unexpected character 'x'; expecting digit."

This is much better. In fact, if your "number" parser accepts fractional 
values, you might get

   "Error on line 1, column 4: Unexpected character 'x'; expecting 
digit, '.' or 'e'."

If you like that, fine. If you'd prefer something shorter, you could put 
P_Name() around number_parser instead (or as well!) to give "expecting 
number literal" or something.

The net result is that when some complex parser fails, you can end up 
with something like

   "Unexpected character '#'; expecting variable name, number iteral or 
';'."

(Haskell has to go one better by making P_Name() an *operator* with 
really low precidence, so you're not forced to put your whole parser in 
brackets.)


## Backtracking ##

In the interests of efficiently, Parsec doesn't actually work in 
*exactly* the way I described. Specifically, "backtracking" is 
*disabled* by default, and you must explicitly enable it wherever you 
want it. This enables the implementation of "infinite look-ahead 
grammars", but with the associated overhead only where necessary.

What does all that lot actually mean? Well it means that Fork() doesn't 
work exactly like I described. How it actually works is this:

- The first parser is executed.
- If it succeeds, return its result.
- If it fails *without consuming any input*, run the second parser.
- If the first parser consumes some input and *then* fails, the second 
parser is not tried, and the whole thing fails.

What this means is that Fork() doesn't have to "remember" where the 
first parser started from, and "rewind" the input stream before starting 
the second parser. In particular, this "remembering" process would 
prevent the GC from freeing the memory being used by the input string, 
since Fork() would still hold a pointer to it.

Basically Fork() assumes that a parser can tell from the very first 
character of input whether it's going to succeed or while. And if it 
fails part-way through, that's a real parsing error. (E.g., a number 
parser encountering "123xyz" would be an error.) It will not "backtrack" 
and run the second parser if the first one fails after consuming input. 
The first parser must fail immediately for that to happen.

But what if you *want* backtracking? Well, then you use the P_Try() 
primitive. This takes a parser and returns a parser. It runs that parser 
in such a way that if it fails, P_Try() makes it "look like" it consumed 
no input. In other words, P_Try() does the "remembering" part (with the 
associated cost in RAM usage).

This way, Fork() uses no extra RAM by default, and if you want 
backtracking (and the inevitable increase in RAM usage), you can 
explicitly ask for P_Try().

You only pay for what you use.


## Assorted notes ##

Note that SuperChain() takes an arbitrary *function* as its second 
argument. That function can perform an arbitrarily complex algorithm 
before returning a parser for SuperChain() to run. This means that you 
can write Turing-complete parsers.

In fact, instead of writing an expression parser that *returns* an 
expression, you could in fact write an expression "parser" that actually 
*executes* the expression as it parses it. (Whether you want to do that 
or not depends on what you're attempting to achieve.)

It also means things like being able to select parsers based on input 
already seen. Want a parser that can detect when an undeclared variable 
name appears? Parsec makes it quite easy. I won't go into exactly how, 
but suffice it to say that Parsec can track user-defined data along with 
its own internal state data (e.g., current line number, etc).


# Lies! #

OK, yeah, I admit it. I lied! I said there would be no Haskell syntax, 
but ">>" actually is the real name of the real Haskell function that 
works exactly the way I described. And SuperChain() is actually called 
">>=". And P_Succeed() is called "return". (Notice how that what most 
parsers use it for - to change the parser's return value.) Finally, 
P_Fail() is actually just "fail".

Together, (>>), (>>=) and (return) comprise Haskell's much-feared 
"monad" construct. If you understood the explanation of SuperChain(), 
you just understood how monads work!

Parsec parsers form a monad, and Parsec is therefore a *monadic* 
combinator library. (A very common form.)

The reason that Haskell has the magical syntax for using SuperChain() is 
that in Haskell, I/O is *also* a monad, and I/O is a Rather Common Task. 
(!) Hence the syntax sugar.

A surprising number of other things turn out to be monads as well - even 
things that don't seem remotely similar. But that's a topic for another 
day...


# Financial contracts #

Lest anybody think that combinator libraries are only good for building 
parsers, let me *briefly* present a second example.

Some while ago, apparently a group of Haskell hackers examined the 
problem of describing and valuing financial contracts. Their results 
evidently caused a bit of a stir in the financial community. (I don't 
know to what extent the authors are just blowing their own trumpet.)

What they wrote is a Haskell combinator library for combining contracts. 
They started with the idea that a contract has a financial value 
(negative or positive). Then they added a combinator that makes the 
value of a contract change at a certain point in time.

Next they added a combinator that makes a new contract who's value at 
any time is equal to the sum of the values of its two subcontracts at 
that time. (So, this is similar to Chain().) And then they added a 
combinator that allows you to select one contract *or* the other, but 
not both. (So this is like Fork().)

It should be easy to see that with these few simple operations, it's 
quite easy to build highly complex contracts. It should also be utterly 
clear that algorithmically determining a contract's value at any point 
in time, given any particular combination of selections, is a pretty 
simple task.

The authors claim that their work was hailed as "visionary". Apparently 
no system had hitherto existed for clearing and precisely stating the 
properties of contracts in an abstract and mathematically rigorous way.

I don't know if any of that is true, but I *do* know that when you're 
trying to solve highly complex problems, being able to precisely define 
what the problem actually *is* in the first place would be a very 
significant step forward.

Post a reply to this message

From: scott
Subject: Re: Very long post
Date: 23 Sep 2008 08:20:00
Message: <48d8def0$1@news.povray.org>

> OK, so the muse has taken me. I want to write something. I realise nobody 
> here really cares, but if you're bored maybe you could take a read and 
> tell me what you think.

I read it, very interesting seeing as I had no idea how a parser works (I 
always wondered how you would go about writing one).

So essentially, you could "just" make one parser object, that consisted of 
many many complex parser objects, to parse something like a POV SDL file? 
Then you would just chuck the POV file at the top level parser object, and 
it would return a nice tree structure of the SDL code?  Hmmmm.

Post a reply to this message

From: Invisible
Subject: Re: Very long post
Date: 23 Sep 2008 08:51:18
Message: <48d8e646$1@news.povray.org>

scott wrote:

> I read it, very interesting seeing as I had no idea how a parser works 
> (I always wondered how you would go about writing one).

The more "usual" approach is usually something based around a finite 
state machine - but as I just explained, Parsec works differently. (And 
indeed will parse things t FSM couldn't parse - but that you'll often 
need that.)

> So essentially, you could "just" make one parser object, that consisted 
> of many many complex parser objects, to parse something like a POV SDL 
> file? Then you would just chuck the POV file at the top level parser 
> object, and it would return a nice tree structure of the SDL code?  Hmmmm.

Pretty much, yeah. SDL is complicated by #declare and #macro though. ;-)

For a "normal" programming language, the traditional approach is this:

1. A "tokeniser" takes the input string and splits it into "tokens", 
possibly decorating them slightly. So now you have a flat stream of 
tokens instead of just characters.

2. A "parser" builds a tree-like structure out of the token list, using 
the syntax rules of the language.

3. You pass this tree ("abstract syntax tree") to a compiler or 
interpretter or whatever the hell it is your program wants to do with 
the thing.

Parsec allows you to parse and then tokenise, or do the whole thing in 
one pass. It's up to you. (I didn't mention it, but Parsec supports 
running parsers over lists of *anything*, not just characters. So you 
could in theory parse raw binary if you wanted...)

POV-Ray SDL is complicated because it doesn't have *functions*, it has 
*macros*, which expand at the token level, not the syntax tree level. 
(So, e.g., the contents of a macro can be an incomplete expression.)

Theoretically Parsec can cope with that - but in practice I'm not sure 
how ugly the code would be. IIRC there's a formal grammer available 
online somewhere, right? Some appendix to the user guide or something? 
Maybe I could have a go at implementing it with Parsec. (It'll be 
*large* though! POV-Ray has a lot of keywords...)

Post a reply to this message

From: Invisible
Subject: Re: Very long post
Date: 23 Sep 2008 11:37:24
Message: <48d90d34@news.povray.org>

Just for giggles, I started trying to implement this in Java. But I 
quickly ended up with 13 classes, almost all of which are about 5 lines 
long.

Parsers are an abstract class. Each primitive parser is a concrete 
subclass. Each parser combinator is a subclass. The parser result is an 
abstract class. Each possible parser result structure is a concrete 
subclass. Parser functions are an interface... My, my, what a lot of 
classes we have there! o_O

It would almost be easier on C++, because then all these tiny classes 
could all be in the same source file. (Plus C++ has the "friend" 
feature, which is eminently useful here.) OTOH, I have no idea how to do 
memory management yet, so that's basically the end of that...

Post a reply to this message

From: Darren New
Subject: Re: Very long post
Date: 23 Sep 2008 11:48:13
Message: <48d90fbd$1@news.povray.org>

Invisible wrote:

I read it all the way through too. Nicely written, good explanation. I'd 
add a bit at the end that points to the online resources you'd need to 
learn it more completely.  (A few spelling mistakes and a dropped word 
or two also, but that's trivia. :-)

> scott wrote:
> 
>> I read it, very interesting seeing as I had no idea how a parser works 
>> (I always wondered how you would go about writing one).
> 
> The more "usual" approach is usually something based around a finite 
> state machine - but as I just explained, Parsec works differently. (And 
> indeed will parse things t FSM couldn't parse - but that you'll often 
> need that.)

Actually, that's not *quite* right.

An FSM will parse what a regular expression parses, which doesn't 
include (for example) matching nested brackets, because there's no 
arbitrarily-large storage area to hold the count of nested brackets.

A "push-down automaton" (or PDA) is one common way of implementing a 
parser. It's a FSM *with* a stack. This is the sort of thing YACC / 
Bison implements. People use it because you can algorithmically generate 
the FSM from the BNF grammar of the language you want to parse.

What Haskell is using is called a recursive-descent parser: the parser 
basically uses the run-time stack in place of an explicit stack. It's 
best for LALR(1) syntaxes, which means "Look Ahead Left to Right 1 
token". I.e., a language you can parse by running strictly from left to 
right, looking only one element ahead, so you don't need the "P_Try()" 
function. Some languages (like Pascal) are explicitly designed this way 
to make it easy to build a recursive descent parser.

> 1. A "tokeniser" takes the input string and splits it into "tokens", 
> possibly decorating them slightly. So now you have a flat stream of 
> tokens instead of just characters.

The advantage of doing it this way is most tokens can be parsed with a 
regular expression, which is much faster (and leads to *much* smaller 
push-down state machines) than doing it with the same PDA as used by the 
rest of the grammar.

> Parsec allows you to parse and then tokenise, or do the whole thing in 
> one pass. It's up to you. (I didn't mention it, but Parsec supports 
> running parsers over lists of *anything*, not just characters. So you 
> could in theory parse raw binary if you wanted...)

Or you could parse a list of tokens coming from the tokenizer, which is 
what I imagine the authors were thinking. :-)

-- 
Darren New / San Diego, CA, USA (PST)

Post a reply to this message

From: Invisible
Subject: Re: Very long post
Date: 23 Sep 2008 12:00:21
Message: <48d91295$1@news.povray.org>

Darren New wrote:

> I read it all the way through too. Nicely written, good explanation.

Thanks. :-)

> I'd 
> add a bit at the end that points to the online resources you'd need to 
> learn it more completely.  (A few spelling mistakes and a dropped word 
> or two also, but that's trivia. :-)

Given that I'm the only person here who knows how to read Haskell 
syntax, a link to the Parsec user manual seemed... a little pointless.

But in case anybody cares:

http://legacy.cs.uu.nl/daan/download/parsec/parsec.html

That's the user's guide. The (very terse!) API reference is here:

http://haskell.org/ghc/docs/latest/html/libraries/

This is all the semi-standard Haskell libraries. The one you're looking 
for is Text.ParserCombinators.Parsec. (It's split into several modules.)

>> The more "usual" approach is usually something based around a finite 
>> state machine - but as I just explained, Parsec works differently. 
>> (And indeed will parse things t FSM couldn't parse - but that you'll 
>> often need that.)
> 
> Actually, that's not *quite* right.
> 
> An FSM will parse what a regular expression parses, which doesn't 
> include (for example) matching nested brackets, because there's no 
> arbitrarily-large storage area to hold the count of nested brackets.
> 
> A "push-down automaton" (or PDA) is one common way of implementing a 
> parser. It's a FSM *with* a stack. This is the sort of thing YACC / 
> Bison implements. People use it because you can algorithmically generate 
> the FSM from the BNF grammar of the language you want to parse.
> 
> What Haskell is using is called a recursive-descent parser: the parser 
> basically uses the run-time stack in place of an explicit stack. It's 
> best for LALR(1) syntaxes, which means "Look Ahead Left to Right 1 
> token". I.e., a language you can parse by running strictly from left to 
> right, looking only one element ahead, so you don't need the "P_Try()" 
> function. Some languages (like Pascal) are explicitly designed this way 
> to make it easy to build a recursive descent parser.

Heh. What do *I* know about parser theory? ;-)

The point is, Parsec will parse just about anything, although as you say 
it "works best" for LALR(1) grammars. It's quite surprising how simple 
and easy to explain the library is - and yet it handles almost anything. 
(I don't know if I made it clear, but Parsec parsers are a state monad.)

>> 1. A "tokeniser" takes the input string and splits it into "tokens", 
>> possibly decorating them slightly. So now you have a flat stream of 
>> tokens instead of just characters.
> 
> The advantage of doing it this way is most tokens can be parsed with a 
> regular expression, which is much faster (and leads to *much* smaller 
> push-down state machines) than doing it with the same PDA as used by the 
> rest of the grammar.

Indeed, the Parsec manual says something similar.

>> Parsec allows you to parse and then tokenise, or do the whole thing in 
>> one pass. It's up to you. (I didn't mention it, but Parsec supports 
>> running parsers over lists of *anything*, not just characters. So you 
>> could in theory parse raw binary if you wanted...)
> 
> Or you could parse a list of tokens coming from the tokenizer, which is 
> what I imagine the authors were thinking. :-)

Precisely that, yes.

Post a reply to this message

From: Darren New
Subject: Re: Very long post
Date: 23 Sep 2008 12:22:05
Message: <48d917ad@news.povray.org>

Invisible wrote:
> Given that I'm the only person here who knows how to read Haskell 
> syntax, a link to the Parsec user manual seemed... a little pointless.

Oh. I *highly* suggest you put this up on your blog or your web site or 
something like that, for public consumption. Get comments from a lot of 
people, find jobs in Haskell and/or teaching because you can write like 
this, etc.

> Heh. What do *I* know about parser theory? ;-)

I don't know. It seemed like a fair amount. :-)  This is all stuff you 
learn in the first semester of compiler construction class.

> The point is, Parsec will parse just about anything, although as you say 
> it "works best" for LALR(1) grammars. It's quite surprising how simple 
> and easy to explain the library is - and yet it handles almost anything. 
> (I don't know if I made it clear, but Parsec parsers are a state monad.)

Yes. I'm just saying it "works best" because LALR(1) parses without ever 
needing to use P_Try().  I.e., you don't need anything like exceptions 
or input buffers to parse an LALR(1) grammar using the programming 
language's stack - you need enough memory to hold one token, and you can 
output results as soon as any production is recognised, and you only 
need stack space proportional to the most nested expression in your 
input, and you don't need any memory to outlast the parsing of the 
productions you're currently parsing.

Of course, once you add arbitrary function calls and arbitrary buffering 
(via P_Try and SuperCombiner) then you can parse lots more; it just 
takes more code and more memory and potentially global data structures.

-- 
Darren New / San Diego, CA, USA (PST)

Post a reply to this message

From: Darren New
Subject: Re: Very long post
Date: 23 Sep 2008 12:26:29
Message: <48d918b5@news.povray.org>

Invisible wrote:
> Just for giggles, I started trying to implement this in Java. But I 
> quickly ended up with 13 classes, almost all of which are about 5 lines 
> long.

Java is the kingdom of nouns. Functions are second-class citizens. Even 
downward funargs need to be associated with a class, even if it's 
anonymous and only has one function and no instance variables in it.

Of course implementing a functional application in Java is going to 
suck. :-)

http://steve-yegge.blogspot.com/2006/03/execution-in-kingdom-of-nouns.html

-- 
Darren New / San Diego, CA, USA (PST)

Post a reply to this message

From: Orchid XP v8
Subject: Re: Very long post
Date: 23 Sep 2008 13:21:47
Message: <48d925ab$1@news.povray.org>

>> Given that I'm the only person here who knows how to read Haskell 
>> syntax, a link to the Parsec user manual seemed... a little pointless.
> 
> Oh. I *highly* suggest you put this up on your blog or your web site or 
> something like that, for public consumption. Get comments from a lot of 
> people, find jobs in Haskell and/or teaching because you can write like 
> this, etc.

Heh. I could add it to my blog. Nobody on earth will ever actually read 
it, but sure, I can add it. ;-)

>> Heh. What do *I* know about parser theory? ;-)
> 
> I don't know. It seemed like a fair amount. :-)  This is all stuff you 
> learn in the first semester of compiler construction class.

Now there's a class that sounds interesting... Pity my "Computer 
Science" degree didn't involve anything remotely like that. ;-)

> Yes. I'm just saying it "works best" because LALR(1) parses without ever 
> needing to use P_Try().  I.e., you don't need anything like exceptions 
> or input buffers to parse an LALR(1) grammar using the programming 
> language's stack - you need enough memory to hold one token, and you can 
> output results as soon as any production is recognised, and you only 
> need stack space proportional to the most nested expression in your 
> input, and you don't need any memory to outlast the parsing of the 
> productions you're currently parsing.
> 
> Of course, once you add arbitrary function calls and arbitrary buffering 
> (via P_Try and SuperCombiner) then you can parse lots more; it just 
> takes more code and more memory and potentially global data structures.

Fair enough. ;-)

Figuring out when to use P_Try() is actually quite subtle by the way - 
especially where you have a language where variable names can't be 
resurved words. Parsec actually provides a small module to automatically 
build parsers for common types of expression and language grammars based 
on the options you supply (automates things like reversed word 
recognition, operator precidence, operator associativity, etc.)

-- 
http://blog.orphi.me.uk/
http://www.zazzle.com/MathematicalOrchid*

Post a reply to this message

From: Darren New
Subject: Re: Very long post
Date: 23 Sep 2008 13:42:58
Message: <48d92aa2@news.povray.org>

Orchid XP v8 wrote:
> Heh. I could add it to my blog. Nobody on earth will ever actually read 
> it, but sure, I can add it. ;-)

You would be surprised. At least you can point people to it who 
interview you or something. You *are* looking for a job, yes?

> Now there's a class that sounds interesting... Pity my "Computer 
> Science" degree didn't involve anything remotely like that. ;-)

Don't forget I have an advanced degree. I don't think my undergraduate 
courses covered the theoretical stuff like that either.

> on the options you supply (automates things like reversed word 
> recognition, operator precidence, operator associativity, etc.)

I'm not surprised. :-)

-- 
Darren New / San Diego, CA, USA (PST)

Post a reply to this message

Goto Latest 10 Messages

Next 10 Messages >>>