POV-Ray : Newsgroups : povray.off-topic : Very long post : Very long post Server Time
6 Sep 2024 23:19:17 EDT (-0400)
  Very long post  
From: Invisible
Date: 23 Sep 2008 07:32:18
Message: <48d8d3c2$1@news.povray.org>
# Preface #

OK, so the muse has taken me. I want to write something. I realise 
nobody here really cares, but if you're bored maybe you could take a 
read and tell me what you think.

(Yes, I *know* Haskell sucks and only a total retard would use such a 
stupid and irrelevant language. I think we've covered that one. I'm not 
asking whether you think Haskell is cool, I'm asking what you think of 
my powers of exposition. Does what I've written *make sense*? That's the 
question I'm interested it.)


# Introduction #

There are many common and well-known idioms in OO languages. For 
example, look at any large project and you'll probably find that 
somewhere, somebody has written a class which is nearly complete except 
for a few abstract methods, which all other functionallity is defined in 
terms of. Then they wrote several concrete subclasses that work purely 
by implementing these abstract methods in various ways.

It's such a common technique that it hardly needs describing. Everybody 
who's ever seen OOP will know what this is about.

Haskell is a much rarer language. It has its own idioms, most of which 
are far less well-known. One very widely used technique is the 
construction of "combinator libraries". Such a library consists of a 
small set of primitive "things", and a set of combinators for combining 
simple things into more complex things. Usually there will also be a 
collection of ready-made things which are either very commonly needed or 
non-obvious in construction.

An absolutely classic application of this technique is the contruction 
of text parsers. I'd like to show off the technique by walking you 
through a high-level description of Haskell's "Parsec" library. (There 
will not be any actual Haskell syntax; this is just a description of 
concepts.)


# Parsec #

## Parsing stuff ##

Suppose you want to parse a text string to extract some kind of data 
from it. If you're working in something like Java, right now you're 
probably thinking something along the lines of designing some kind of 
"paser object" that has an "input buffer" and various methods for 
mutating the paser state. Maybe it's based on some sort of finite state 
machine?...

Parsec takes a completely different approach - and that is the subject 
of this text.

The Parsec library defines a "parser" data type. What this actually is 
or how it works doesn't matter; a Parsec client sees only the type name. 
Parsec also provides a Run() function which takes a parser and some text 
and "runs" the paser over it.

There are a few things to explain about parsers. First of all, when a 
parser is run, it can consume zero or more characters. That is, a parser 
can terminate immediately without consuming any input at all, or a 
single parser can swallow millions of characters.

Second, when you run a parser, it can "succeed" or "fail". When a parser 
succeeds, it returns an item of data. This can be of any data type 
(e.g., it could be an integer, a string, or a list of binary trees). 
When a parser "fails", this is not a fatal program error, it just means 
that the parser returns an internal Parsec error description data 
structure. This will become clearer later.

The Run() function returns a structure indicating whether parsing 
succeeded or failed. If successful, it returns whatever the parser 
itself returned. If failed, it returns an opaque error object. There is 
a function to conviniently transform this into human-readable text, but 
you can also programmatically query it (e.g., to get the line number of 
the error so you can hilight it on screen or something).


## The parsers ##

So, that's what a parser "is", in Parsec's terminology. Now, what 
parsers does Parsec actually implement? Well, it implements the following:

- P_Fail(msg) is a parser that always fails immediately, returning "msg" 
as the error message.
- P_Succeed(x) is a parser that always succeeds immediately (i.e., 
consuming 0 characters of input). It always returns the same value, "x".
- P_Any() is a parser that reads 1 character of input and returns that 
character. (It fails with a descriptive message if EOF.)
- P_Char(x) will read 1 character of input and return that character *if 
and only if* that character happens to be "x". Otherwise it fails with 
an error like "unexpected character 'J'; expected 'F'".

(The names are made-up, but Parsec really does have all of these.)

As you can see, these parsers are not so much "simple" as "trivial". 
Their internal implementation is slightly more complex than it would 
immediately seem, but even so, each one is only a few lines of Haskell.

So what can we *do* using these parsers? Well, not a lot, really. To do 
anything useful, we need some combinators...


## The Chain() combinator ##

The first combinator we're going to look at is something that I'm going 
to call Chain(). Consider, for example,

   parserX = Chain(parser1, parser2);

So Chain() takes two parsers and returns a parser (much like "+" takes 
two numbers and returns a number). But what does this new parser *do* 
when you run it?

- First, it runs parser1.
- If parser1 succeeds, its result is thrown away, and parser2 is run.
- If parser2 succeeds, its result is returned as the overall result of 
parserX.

In other words, it runs the first parser and then the second parser. But 
what if one or other parser fails?

- If parser1 fails, parser2 is not run at all, and parserX fails overall 
(with the same error message as parser1).
- If parser1 succeeds but parser2 fails, parserX fails overall (with the 
same error message as parser2).

It should be clear at this point that Chain() is associative - i.e., that

   parserX = Chain(Chain(x, y), z);
   parserY = Chain(x, Chain(y, z));

are two equivilent parsers. It should also be clear that Chain() 
therefore allows us to take an arbitrary number of parsers and "chain 
them together in sequence" to build one big parser that will run the 
component parsers one after another. And, finally, the first time any 
parser fails, all of the following parsers are automatically skipped and 
the overall parser fails. (A bit like throwing an exception - except 
that this is a library feature, not a hard-wired language construct.)

Now, writing Chain(Chain(Chain(...))) gets tedious quickly, so from now 
on I'm going to use a simple shorthand. Instead of "Chain(x, y)", I'm 
going to just write "x >> y". Just remember that it means "execute x and 
then execute y" (with a few additional details, as per above).

Using this new Chain() / ">>" combinator, I can write

   P_Char('H') >> P_Char('E') >> P_Char('L') >> P_Char('P');

which represents a parser which succeeds if the input begins "HELP", and 
fails otherwise. (Note that this includes checking for EOF.) As it 
stands, however, when this parser succeeds it returns 'P', which isn't 
fantastically useful. However, consider the following trick:

   P_Char('H') >> P_Char('E') >> P_Char('L') >> P_Char('P') >> 
P_Succeed("HELP");

Recall that P_Succeed() always succeeds and returns the specified value. 
However, >> will only execute this if all the preceeding parsers 
succeed. In other words, this parser does exactly the same thing as the 
previous one, except that it returns "HELP" instead of just 'P' as its 
result.

It turns out that parsers like the one above are extremely common, so 
Parsec provides us with a shortcut for writing them:

   function P_String(string)
   {
     out = P_Char(string[0]);

     for (int i=1; i < string.size(); i++)
     out = Chain(out, P_Char(string[i]));

     return Chain(out, P_Succeed(string));
   }

Now we can just say P_String("HELP") to generate the above parser. That 
should save us some typing!

So, we now have a parser that can read an arbitrary string, checking for 
EOF at each stage, and output readable error messages. (E.g., if you do 
P_String("HELP") and give it "HEAD", you get "error on line 1, column 3: 
unexpected character 'A'; expected 'L'".)

But this still isn't terribly exciting; a simple string comparison would 
do the same job!


## The Fork() combinator ##

The next combinator is called Fork(). It also takes a pair of parsers 
and returns a parser. However, just like "+" and "-" return different 
numbers, this combinator returns a slightly different parser:

   parserX = Chain(parser1, parser2);
   parserY = Fork(parser1, parser2);

As you know, parserX will run parser1 and then run parser2 only if 
parser1 succeeds. However, parserY runs parser2 only if parser1 *fails*. 
So Chain() means "do this and then to that", whereas Fork() means "do 
this or else do that". Where Chain() combines parsers into a sequential 
chain, Fork() combines parsers into a list of alternatives. For example:

   bool_parser = Fork(P_String("YES"), P_String("NO"));

This parser accepts either the word "YES" *or* the word "NO", and 
returns the word read. (If the input is anything else, it fails.)

So you can now see that when a parser "fails", this is a normal and 
ordinary part of how complex parsers work internally. "Let's parse this 
as an integer. Oh, that failed. OK, let's parse it as a variable. Oh, 
that failed too. OK, so let's parse it as a string literal..."

Notice that if *both* parsers fail, the overall parser fails, but the 
error messages are *merged*. (This is why Parsec errors are intricate 
data structures rather than simple text messages.) For example, if 
Fork(P_Char('X'), P_Char('Y')) were to fail, it might say

   Unexpected character 'J'; expecting 'X' or 'Y'.

The Parsec client gets this functionallity automatically as a 
consequence of merely using Fork(). This makes for easy automatic error 
handling.

Again, Fork() is associative like Chain() is:

   Fork(Fork(x, y), z) === Fork(x, Fork(y, z))

And again, I'm going to write "x <|> y" instead of "Fork(x, y)" from now on.

Using this notation, I can write

   colour_parser =
     P_String("RED")
     <|>
     P_String("GREEN")
     <|>
     P_String("Blue");

This accepts any of the given words, or it fails (with a message like 
"expecting 'RED' or 'GREEN' or 'BLUE'"; the P_String() shortcut 
prettifies the error messages slightly).

By using Chain() and Fork() together, I can define parsers such as

   expression_parser =
     P_Char('X')
     <|>
     (P_Char('[') >> expression_parser >> P_Char(']'));

This parser will accept any of the following strings:

   X
   [X]
   [[[X]]]
   [[[[[[[[X]]]]]]]]

So, basically, an "X" surrounded by any number of square brackets - but 
the open and closing brackets *must* match up! (And if they don't, 
you'll get a helpful error message.) This is slightly less trivial than 
the parsers we've seen so far, but still hardly earth-shattering.


## Optional parsing ##

Take a look at the following parser:

   P_Char('X') <|> P_Succeed('?')

If the next character is an "X", the first parser succeeds and returns 
an "X". However, if it fails, Fork() / "<|>" will run the second parser, 
which doesn't even *look* at the input, it just returns "?".

This turns out to be quite common, so Parsec provides a shortcut:

   function P_Optional(parser, default)
   {
     return Fork(parser, P_Suceed(default));
   }

Using this, I can define the following:

   bracket_parser =
     P_Char('[') >> expression_parser >> P_Char(']');

   expression_parser =
     (P_Char('X') <|> bracket_parser) >>
     P_Optional(expression_parser, "");

This is almost the same as the previous expression parser, except for 
the P_Optional() part. But that part means that now all of the following 
strings will be accepted:

   X
   XXX
   X[X]X
   [[[X[X]X]]]
   [[[X]]]X[[[[X]]]]
   [X[X[[X]]X]X]X

In other words, *any expression* composed of X's and balanced square 
brackets.

This, then, is something that a trivial string compare could not do. 
(Although it wouldn't be hard to manually implement such a parser.)


## Multiple parsing ##

Another thing we could very well want to do is parse multiple copies of 
something. (E.g., a "number" consists of multiple "digits".) This is 
also extremely common, so Parsec provides a shortcut:

   function P_Many1(parser1)
   {
     return (parser1 >> P_Optional(P_Many(parser1), []));
   }

This parses one of something, and then optionally parses some more of 
them by recursively calling itself. This lets you parse >= 1 of 
something; another Parsec shortcut allows >= 0 of something. (Beware of 
infinite loops! This parser must be used cautiously.) There is also a 
variant for parsing exactly N of something.

Another common requirement is to parse a sequence of things with 
seperators/terminators, and Parsec provides shortcuts for those too.


## The SuperChain() combinator ##

So far, most of our parsers don't "return" any interesting data. They 
succeed for some inputs and fail for others, but they don't return much 
data to the caller. (Typically just the last character of input or some 
such.) How about we remedy that?

Consider the following parser:

   parser1 >> parser2

Of course, this is really

   Chain(parser1, parser2)

And I already told you that Chain() will execute the first parser, 
*throw the result away* and then execute the second parser. Well suppose 
we'd actually like a parser that returns *both* results. How do we do that?

Well, it's impossible using Chain(). But there is another combinator 
called SuperChain() that works in a slightly different way:

   SuperChain( <parser> , <function> )

It first runs the parser. If that fails, the whole thing fails, as 
before. But assuming the parser succeeds, SuperChain() takes the 
parser's result and passes it to the function. The function is required 
to return another parser, which SuperChain() then runs.

This has two effects. The first effect is that the function can look at 
the token (or whatever) just read, and perform some completely arbitrary 
algorithm to decide what to do next. It then returns the parser it would 
like to be run next. This allows the implementation of 
"context-sensitive grammers".

For example, in the TeX language, usually commands start with "\". 
However, this is configurable. It can be changed half way through a 
file! (The standard "\verb" command actually does this.) But using 
Parsec, you can write a parser which will be able to cope if the 
character is changed; just write a function that returns a different 
parser! [Actually this is a simplification, but you get the flavour.]

The other thing it allows is a solution to our problem above: we want to 
run two parsers and return both of their results. This can be solved in 
the following way:

   parserX = SuperChain(parser1, foo);

   function foo(x)
   {
     function bar(y)
     {
       return P_Succeed([x, y]);
     }

     return SuperChain(parser2, bar);
   }

When parserX is run, it will run parser1 and pass its result to foo(). 
foo() then returns a SuperChain() parser which runs parser2 and passes 
its result to bar(). Notice that bar() is a nested function within 
foo(), and hence can refer to "x", which is foo's function argument. The 
bar() function simply returns a P_Succeed parser which returns the data 
we want.

In other words, parserX runs parser1 and then parser2 and returns both 
of their results. (Not forgetting that either of these parsers might 
fail, producing an error description, and the flow control is invidibly 
handled by SuperChain() in the same way as Chain() does.)

That seems like an aweful lot of work just to perform such a trivial 
task! Fortunately, since this kind of thing is very common in Haskell, 
it has some special syntax sugar for making it very easy to write. I 
won't use the exact syntax, but suffice it to say the above can be 
written to look something like

   parserX =
     begin
       parser1 --> x;
       parser2 --> y;
       P_Succeed([x, y]);
     end;

which is obviously a lot clearer. Basically

   ...
   parser1;
   parser2;
   parser3;
   ...

becomes calls to Chain(), while

   ...
   parser1 --> x;
   parser2 --> y;
   parser3 --> z;
   ...

becomes calls to SuperChain() with auxhilery functions generated 
inbetween. (You don't need to comprehend the precise algorithm, but 
trust me, it works.)

Using this technology, we can write something like

   bracket_parser =
     begin
       P_Char('[');
       expression_parser --> expr;
       P_Char(']');
       P_Succeed([expr]);
     end;

   X_parser =
     begin
       P_Char('X');
       P_Succeed(['x']);
     end;

   term_parser = X_parser <|> bracket_parser;

   expression_parser =
     begin
       term_parser                       --> term1;
       P_Optional(expression_parser, []) --> term2;
       P_Succeed(term1.join(term2));
     end;

This will parse a string such as "[X[X]X]" and return a list object like 
['X',['X'],'X']. And so, finally, we have a parser that generates a 
nontrivial data structure as its result.


## Caution ##

Suppose you wanted to write a Lambda expression parser. The grammar for 
that looks like this:

   <expression> =
     <var name> |
     "(" <expression> ")" |
     <expression> <expression> |
     "\" <var name > "->" <expression>

You might attempt to implement this as

   expression_parser =
     ...
     <|>
     (
       begin
         expression_parser --> expr1;
         expression_parser --> expr2;
         P_Succeed([expr1, expr2]);
       end;
     )
     <|>
     ...

Don't do that! You have just implemented a "left-recursive grammar", 
which Parsec cannot handle. You must first refactor it to avoid 
left-recursion. To understand why, consider this:

   OK, so how do I parse an expression? OK, well one of the parse rules 
says to start by parsing an expression. OK, so how do I parse an expression?

The parser will sit an in infinite loop, frantically chasing its tail 
and making no progress. The correct solution here is to parse one 
expression and then use P_Optional() to attempt to parse a second one, 
if there is one.

   term_parser = ...

   expression_parser =
     begin
       term_parser --> term1
       P_Optional(expression_parser, empty_expression) --> term2
       ...combine terms...
     end;

Notice how the *first* line parses a single term, while the *second* 
line recursively parses an entire expression. That way expressions of 
arbitrary size are parsable, but the infinite recursion is avoided.


## Error messages ##

Suppose you write an expression parser and tried to parse "123xyz". You 
might get an error like

   "Error on line 1, column 4: Unexpected character 'x'; expecting '0', 
'1', '2', '3', '4', '5', '6', '7', '8' or '9'."

While this error is completely correct technically, it's a tad verbose. 
Once you start writing parsers for sophisticated grammars, you get error 
messages indicating that all sorts of punctuation marks would be legal 
here, but it can become a little overhelming trying to figure out what 
the parser is actually unhappy about.

Parsec provides the P_Name() function. So for example:

   digit_parser =
     P_Name(
       P_Char('0') <|> P_Char('1') <|> ...,
       "digit"
     );

   number_parser = P_Many1(digit_parser);

Now you get the following error message:

   "Error on line 1, column 4: Unexpected character 'x'; expecting digit."

This is much better. In fact, if your "number" parser accepts fractional 
values, you might get

   "Error on line 1, column 4: Unexpected character 'x'; expecting 
digit, '.' or 'e'."

If you like that, fine. If you'd prefer something shorter, you could put 
P_Name() around number_parser instead (or as well!) to give "expecting 
number literal" or something.

The net result is that when some complex parser fails, you can end up 
with something like

   "Unexpected character '#'; expecting variable name, number iteral or 
';'."

(Haskell has to go one better by making P_Name() an *operator* with 
really low precidence, so you're not forced to put your whole parser in 
brackets.)


## Backtracking ##

In the interests of efficiently, Parsec doesn't actually work in 
*exactly* the way I described. Specifically, "backtracking" is 
*disabled* by default, and you must explicitly enable it wherever you 
want it. This enables the implementation of "infinite look-ahead 
grammars", but with the associated overhead only where necessary.

What does all that lot actually mean? Well it means that Fork() doesn't 
work exactly like I described. How it actually works is this:

- The first parser is executed.
- If it succeeds, return its result.
- If it fails *without consuming any input*, run the second parser.
- If the first parser consumes some input and *then* fails, the second 
parser is not tried, and the whole thing fails.

What this means is that Fork() doesn't have to "remember" where the 
first parser started from, and "rewind" the input stream before starting 
the second parser. In particular, this "remembering" process would 
prevent the GC from freeing the memory being used by the input string, 
since Fork() would still hold a pointer to it.

Basically Fork() assumes that a parser can tell from the very first 
character of input whether it's going to succeed or while. And if it 
fails part-way through, that's a real parsing error. (E.g., a number 
parser encountering "123xyz" would be an error.) It will not "backtrack" 
and run the second parser if the first one fails after consuming input. 
The first parser must fail immediately for that to happen.

But what if you *want* backtracking? Well, then you use the P_Try() 
primitive. This takes a parser and returns a parser. It runs that parser 
in such a way that if it fails, P_Try() makes it "look like" it consumed 
no input. In other words, P_Try() does the "remembering" part (with the 
associated cost in RAM usage).

This way, Fork() uses no extra RAM by default, and if you want 
backtracking (and the inevitable increase in RAM usage), you can 
explicitly ask for P_Try().

You only pay for what you use.


## Assorted notes ##

Note that SuperChain() takes an arbitrary *function* as its second 
argument. That function can perform an arbitrarily complex algorithm 
before returning a parser for SuperChain() to run. This means that you 
can write Turing-complete parsers.

In fact, instead of writing an expression parser that *returns* an 
expression, you could in fact write an expression "parser" that actually 
*executes* the expression as it parses it. (Whether you want to do that 
or not depends on what you're attempting to achieve.)

It also means things like being able to select parsers based on input 
already seen. Want a parser that can detect when an undeclared variable 
name appears? Parsec makes it quite easy. I won't go into exactly how, 
but suffice it to say that Parsec can track user-defined data along with 
its own internal state data (e.g., current line number, etc).


# Lies! #

OK, yeah, I admit it. I lied! I said there would be no Haskell syntax, 
but ">>" actually is the real name of the real Haskell function that 
works exactly the way I described. And SuperChain() is actually called 
">>=". And P_Succeed() is called "return". (Notice how that what most 
parsers use it for - to change the parser's return value.) Finally, 
P_Fail() is actually just "fail".

Together, (>>), (>>=) and (return) comprise Haskell's much-feared 
"monad" construct. If you understood the explanation of SuperChain(), 
you just understood how monads work!

Parsec parsers form a monad, and Parsec is therefore a *monadic* 
combinator library. (A very common form.)

The reason that Haskell has the magical syntax for using SuperChain() is 
that in Haskell, I/O is *also* a monad, and I/O is a Rather Common Task. 
(!) Hence the syntax sugar.

A surprising number of other things turn out to be monads as well - even 
things that don't seem remotely similar. But that's a topic for another 
day...


# Financial contracts #

Lest anybody think that combinator libraries are only good for building 
parsers, let me *briefly* present a second example.

Some while ago, apparently a group of Haskell hackers examined the 
problem of describing and valuing financial contracts. Their results 
evidently caused a bit of a stir in the financial community. (I don't 
know to what extent the authors are just blowing their own trumpet.)

What they wrote is a Haskell combinator library for combining contracts. 
They started with the idea that a contract has a financial value 
(negative or positive). Then they added a combinator that makes the 
value of a contract change at a certain point in time.

Next they added a combinator that makes a new contract who's value at 
any time is equal to the sum of the values of its two subcontracts at 
that time. (So, this is similar to Chain().) And then they added a 
combinator that allows you to select one contract *or* the other, but 
not both. (So this is like Fork().)

It should be easy to see that with these few simple operations, it's 
quite easy to build highly complex contracts. It should also be utterly 
clear that algorithmically determining a contract's value at any point 
in time, given any particular combination of selections, is a pretty 
simple task.

The authors claim that their work was hailed as "visionary". Apparently 
no system had hitherto existed for clearing and precisely stating the 
properties of contracts in an abstract and mathematically rigorous way.

I don't know if any of that is true, but I *do* know that when you're 
trying to solve highly complex problems, being able to precisely define 
what the problem actually *is* in the first place would be a very 
significant step forward.


Post a reply to this message

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.