|
|
# Preface #
OK, so the muse has taken me. I want to write something. I realise
nobody here really cares, but if you're bored maybe you could take a
read and tell me what you think.
(Yes, I *know* Haskell sucks and only a total retard would use such a
stupid and irrelevant language. I think we've covered that one. I'm not
asking whether you think Haskell is cool, I'm asking what you think of
my powers of exposition. Does what I've written *make sense*? That's the
question I'm interested it.)
# Introduction #
There are many common and well-known idioms in OO languages. For
example, look at any large project and you'll probably find that
somewhere, somebody has written a class which is nearly complete except
for a few abstract methods, which all other functionallity is defined in
terms of. Then they wrote several concrete subclasses that work purely
by implementing these abstract methods in various ways.
It's such a common technique that it hardly needs describing. Everybody
who's ever seen OOP will know what this is about.
Haskell is a much rarer language. It has its own idioms, most of which
are far less well-known. One very widely used technique is the
construction of "combinator libraries". Such a library consists of a
small set of primitive "things", and a set of combinators for combining
simple things into more complex things. Usually there will also be a
collection of ready-made things which are either very commonly needed or
non-obvious in construction.
An absolutely classic application of this technique is the contruction
of text parsers. I'd like to show off the technique by walking you
through a high-level description of Haskell's "Parsec" library. (There
will not be any actual Haskell syntax; this is just a description of
concepts.)
# Parsec #
## Parsing stuff ##
Suppose you want to parse a text string to extract some kind of data
from it. If you're working in something like Java, right now you're
probably thinking something along the lines of designing some kind of
"paser object" that has an "input buffer" and various methods for
mutating the paser state. Maybe it's based on some sort of finite state
machine?...
Parsec takes a completely different approach - and that is the subject
of this text.
The Parsec library defines a "parser" data type. What this actually is
or how it works doesn't matter; a Parsec client sees only the type name.
Parsec also provides a Run() function which takes a parser and some text
and "runs" the paser over it.
There are a few things to explain about parsers. First of all, when a
parser is run, it can consume zero or more characters. That is, a parser
can terminate immediately without consuming any input at all, or a
single parser can swallow millions of characters.
Second, when you run a parser, it can "succeed" or "fail". When a parser
succeeds, it returns an item of data. This can be of any data type
(e.g., it could be an integer, a string, or a list of binary trees).
When a parser "fails", this is not a fatal program error, it just means
that the parser returns an internal Parsec error description data
structure. This will become clearer later.
The Run() function returns a structure indicating whether parsing
succeeded or failed. If successful, it returns whatever the parser
itself returned. If failed, it returns an opaque error object. There is
a function to conviniently transform this into human-readable text, but
you can also programmatically query it (e.g., to get the line number of
the error so you can hilight it on screen or something).
## The parsers ##
So, that's what a parser "is", in Parsec's terminology. Now, what
parsers does Parsec actually implement? Well, it implements the following:
- P_Fail(msg) is a parser that always fails immediately, returning "msg"
as the error message.
- P_Succeed(x) is a parser that always succeeds immediately (i.e.,
consuming 0 characters of input). It always returns the same value, "x".
- P_Any() is a parser that reads 1 character of input and returns that
character. (It fails with a descriptive message if EOF.)
- P_Char(x) will read 1 character of input and return that character *if
and only if* that character happens to be "x". Otherwise it fails with
an error like "unexpected character 'J'; expected 'F'".
(The names are made-up, but Parsec really does have all of these.)
As you can see, these parsers are not so much "simple" as "trivial".
Their internal implementation is slightly more complex than it would
immediately seem, but even so, each one is only a few lines of Haskell.
So what can we *do* using these parsers? Well, not a lot, really. To do
anything useful, we need some combinators...
## The Chain() combinator ##
The first combinator we're going to look at is something that I'm going
to call Chain(). Consider, for example,
parserX = Chain(parser1, parser2);
So Chain() takes two parsers and returns a parser (much like "+" takes
two numbers and returns a number). But what does this new parser *do*
when you run it?
- First, it runs parser1.
- If parser1 succeeds, its result is thrown away, and parser2 is run.
- If parser2 succeeds, its result is returned as the overall result of
parserX.
In other words, it runs the first parser and then the second parser. But
what if one or other parser fails?
- If parser1 fails, parser2 is not run at all, and parserX fails overall
(with the same error message as parser1).
- If parser1 succeeds but parser2 fails, parserX fails overall (with the
same error message as parser2).
It should be clear at this point that Chain() is associative - i.e., that
parserX = Chain(Chain(x, y), z);
parserY = Chain(x, Chain(y, z));
are two equivilent parsers. It should also be clear that Chain()
therefore allows us to take an arbitrary number of parsers and "chain
them together in sequence" to build one big parser that will run the
component parsers one after another. And, finally, the first time any
parser fails, all of the following parsers are automatically skipped and
the overall parser fails. (A bit like throwing an exception - except
that this is a library feature, not a hard-wired language construct.)
Now, writing Chain(Chain(Chain(...))) gets tedious quickly, so from now
on I'm going to use a simple shorthand. Instead of "Chain(x, y)", I'm
going to just write "x >> y". Just remember that it means "execute x and
then execute y" (with a few additional details, as per above).
Using this new Chain() / ">>" combinator, I can write
P_Char('H') >> P_Char('E') >> P_Char('L') >> P_Char('P');
which represents a parser which succeeds if the input begins "HELP", and
fails otherwise. (Note that this includes checking for EOF.) As it
stands, however, when this parser succeeds it returns 'P', which isn't
fantastically useful. However, consider the following trick:
P_Char('H') >> P_Char('E') >> P_Char('L') >> P_Char('P') >>
P_Succeed("HELP");
Recall that P_Succeed() always succeeds and returns the specified value.
However, >> will only execute this if all the preceeding parsers
succeed. In other words, this parser does exactly the same thing as the
previous one, except that it returns "HELP" instead of just 'P' as its
result.
It turns out that parsers like the one above are extremely common, so
Parsec provides us with a shortcut for writing them:
function P_String(string)
{
out = P_Char(string[0]);
for (int i=1; i < string.size(); i++)
out = Chain(out, P_Char(string[i]));
return Chain(out, P_Succeed(string));
}
Now we can just say P_String("HELP") to generate the above parser. That
should save us some typing!
So, we now have a parser that can read an arbitrary string, checking for
EOF at each stage, and output readable error messages. (E.g., if you do
P_String("HELP") and give it "HEAD", you get "error on line 1, column 3:
unexpected character 'A'; expected 'L'".)
But this still isn't terribly exciting; a simple string comparison would
do the same job!
## The Fork() combinator ##
The next combinator is called Fork(). It also takes a pair of parsers
and returns a parser. However, just like "+" and "-" return different
numbers, this combinator returns a slightly different parser:
parserX = Chain(parser1, parser2);
parserY = Fork(parser1, parser2);
As you know, parserX will run parser1 and then run parser2 only if
parser1 succeeds. However, parserY runs parser2 only if parser1 *fails*.
So Chain() means "do this and then to that", whereas Fork() means "do
this or else do that". Where Chain() combines parsers into a sequential
chain, Fork() combines parsers into a list of alternatives. For example:
bool_parser = Fork(P_String("YES"), P_String("NO"));
This parser accepts either the word "YES" *or* the word "NO", and
returns the word read. (If the input is anything else, it fails.)
So you can now see that when a parser "fails", this is a normal and
ordinary part of how complex parsers work internally. "Let's parse this
as an integer. Oh, that failed. OK, let's parse it as a variable. Oh,
that failed too. OK, so let's parse it as a string literal..."
Notice that if *both* parsers fail, the overall parser fails, but the
error messages are *merged*. (This is why Parsec errors are intricate
data structures rather than simple text messages.) For example, if
Fork(P_Char('X'), P_Char('Y')) were to fail, it might say
Unexpected character 'J'; expecting 'X' or 'Y'.
The Parsec client gets this functionallity automatically as a
consequence of merely using Fork(). This makes for easy automatic error
handling.
Again, Fork() is associative like Chain() is:
Fork(Fork(x, y), z) === Fork(x, Fork(y, z))
And again, I'm going to write "x <|> y" instead of "Fork(x, y)" from now on.
Using this notation, I can write
colour_parser =
P_String("RED")
<|>
P_String("GREEN")
<|>
P_String("Blue");
This accepts any of the given words, or it fails (with a message like
"expecting 'RED' or 'GREEN' or 'BLUE'"; the P_String() shortcut
prettifies the error messages slightly).
By using Chain() and Fork() together, I can define parsers such as
expression_parser =
P_Char('X')
<|>
(P_Char('[') >> expression_parser >> P_Char(']'));
This parser will accept any of the following strings:
X
[X]
[[[X]]]
[[[[[[[[X]]]]]]]]
So, basically, an "X" surrounded by any number of square brackets - but
the open and closing brackets *must* match up! (And if they don't,
you'll get a helpful error message.) This is slightly less trivial than
the parsers we've seen so far, but still hardly earth-shattering.
## Optional parsing ##
Take a look at the following parser:
P_Char('X') <|> P_Succeed('?')
If the next character is an "X", the first parser succeeds and returns
an "X". However, if it fails, Fork() / "<|>" will run the second parser,
which doesn't even *look* at the input, it just returns "?".
This turns out to be quite common, so Parsec provides a shortcut:
function P_Optional(parser, default)
{
return Fork(parser, P_Suceed(default));
}
Using this, I can define the following:
bracket_parser =
P_Char('[') >> expression_parser >> P_Char(']');
expression_parser =
(P_Char('X') <|> bracket_parser) >>
P_Optional(expression_parser, "");
This is almost the same as the previous expression parser, except for
the P_Optional() part. But that part means that now all of the following
strings will be accepted:
X
XXX
X[X]X
[[[X[X]X]]]
[[[X]]]X[[[[X]]]]
[X[X[[X]]X]X]X
In other words, *any expression* composed of X's and balanced square
brackets.
This, then, is something that a trivial string compare could not do.
(Although it wouldn't be hard to manually implement such a parser.)
## Multiple parsing ##
Another thing we could very well want to do is parse multiple copies of
something. (E.g., a "number" consists of multiple "digits".) This is
also extremely common, so Parsec provides a shortcut:
function P_Many1(parser1)
{
return (parser1 >> P_Optional(P_Many(parser1), []));
}
This parses one of something, and then optionally parses some more of
them by recursively calling itself. This lets you parse >= 1 of
something; another Parsec shortcut allows >= 0 of something. (Beware of
infinite loops! This parser must be used cautiously.) There is also a
variant for parsing exactly N of something.
Another common requirement is to parse a sequence of things with
seperators/terminators, and Parsec provides shortcuts for those too.
## The SuperChain() combinator ##
So far, most of our parsers don't "return" any interesting data. They
succeed for some inputs and fail for others, but they don't return much
data to the caller. (Typically just the last character of input or some
such.) How about we remedy that?
Consider the following parser:
parser1 >> parser2
Of course, this is really
Chain(parser1, parser2)
And I already told you that Chain() will execute the first parser,
*throw the result away* and then execute the second parser. Well suppose
we'd actually like a parser that returns *both* results. How do we do that?
Well, it's impossible using Chain(). But there is another combinator
called SuperChain() that works in a slightly different way:
SuperChain( <parser> , <function> )
It first runs the parser. If that fails, the whole thing fails, as
before. But assuming the parser succeeds, SuperChain() takes the
parser's result and passes it to the function. The function is required
to return another parser, which SuperChain() then runs.
This has two effects. The first effect is that the function can look at
the token (or whatever) just read, and perform some completely arbitrary
algorithm to decide what to do next. It then returns the parser it would
like to be run next. This allows the implementation of
"context-sensitive grammers".
For example, in the TeX language, usually commands start with "\".
However, this is configurable. It can be changed half way through a
file! (The standard "\verb" command actually does this.) But using
Parsec, you can write a parser which will be able to cope if the
character is changed; just write a function that returns a different
parser! [Actually this is a simplification, but you get the flavour.]
The other thing it allows is a solution to our problem above: we want to
run two parsers and return both of their results. This can be solved in
the following way:
parserX = SuperChain(parser1, foo);
function foo(x)
{
function bar(y)
{
return P_Succeed([x, y]);
}
return SuperChain(parser2, bar);
}
When parserX is run, it will run parser1 and pass its result to foo().
foo() then returns a SuperChain() parser which runs parser2 and passes
its result to bar(). Notice that bar() is a nested function within
foo(), and hence can refer to "x", which is foo's function argument. The
bar() function simply returns a P_Succeed parser which returns the data
we want.
In other words, parserX runs parser1 and then parser2 and returns both
of their results. (Not forgetting that either of these parsers might
fail, producing an error description, and the flow control is invidibly
handled by SuperChain() in the same way as Chain() does.)
That seems like an aweful lot of work just to perform such a trivial
task! Fortunately, since this kind of thing is very common in Haskell,
it has some special syntax sugar for making it very easy to write. I
won't use the exact syntax, but suffice it to say the above can be
written to look something like
parserX =
begin
parser1 --> x;
parser2 --> y;
P_Succeed([x, y]);
end;
which is obviously a lot clearer. Basically
...
parser1;
parser2;
parser3;
...
becomes calls to Chain(), while
...
parser1 --> x;
parser2 --> y;
parser3 --> z;
...
becomes calls to SuperChain() with auxhilery functions generated
inbetween. (You don't need to comprehend the precise algorithm, but
trust me, it works.)
Using this technology, we can write something like
bracket_parser =
begin
P_Char('[');
expression_parser --> expr;
P_Char(']');
P_Succeed([expr]);
end;
X_parser =
begin
P_Char('X');
P_Succeed(['x']);
end;
term_parser = X_parser <|> bracket_parser;
expression_parser =
begin
term_parser --> term1;
P_Optional(expression_parser, []) --> term2;
P_Succeed(term1.join(term2));
end;
This will parse a string such as "[X[X]X]" and return a list object like
['X',['X'],'X']. And so, finally, we have a parser that generates a
nontrivial data structure as its result.
## Caution ##
Suppose you wanted to write a Lambda expression parser. The grammar for
that looks like this:
<expression> =
<var name> |
"(" <expression> ")" |
<expression> <expression> |
"\" <var name > "->" <expression>
You might attempt to implement this as
expression_parser =
...
<|>
(
begin
expression_parser --> expr1;
expression_parser --> expr2;
P_Succeed([expr1, expr2]);
end;
)
<|>
...
Don't do that! You have just implemented a "left-recursive grammar",
which Parsec cannot handle. You must first refactor it to avoid
left-recursion. To understand why, consider this:
OK, so how do I parse an expression? OK, well one of the parse rules
says to start by parsing an expression. OK, so how do I parse an expression?
The parser will sit an in infinite loop, frantically chasing its tail
and making no progress. The correct solution here is to parse one
expression and then use P_Optional() to attempt to parse a second one,
if there is one.
term_parser = ...
expression_parser =
begin
term_parser --> term1
P_Optional(expression_parser, empty_expression) --> term2
...combine terms...
end;
Notice how the *first* line parses a single term, while the *second*
line recursively parses an entire expression. That way expressions of
arbitrary size are parsable, but the infinite recursion is avoided.
## Error messages ##
Suppose you write an expression parser and tried to parse "123xyz". You
might get an error like
"Error on line 1, column 4: Unexpected character 'x'; expecting '0',
'1', '2', '3', '4', '5', '6', '7', '8' or '9'."
While this error is completely correct technically, it's a tad verbose.
Once you start writing parsers for sophisticated grammars, you get error
messages indicating that all sorts of punctuation marks would be legal
here, but it can become a little overhelming trying to figure out what
the parser is actually unhappy about.
Parsec provides the P_Name() function. So for example:
digit_parser =
P_Name(
P_Char('0') <|> P_Char('1') <|> ...,
"digit"
);
number_parser = P_Many1(digit_parser);
Now you get the following error message:
"Error on line 1, column 4: Unexpected character 'x'; expecting digit."
This is much better. In fact, if your "number" parser accepts fractional
values, you might get
"Error on line 1, column 4: Unexpected character 'x'; expecting
digit, '.' or 'e'."
If you like that, fine. If you'd prefer something shorter, you could put
P_Name() around number_parser instead (or as well!) to give "expecting
number literal" or something.
The net result is that when some complex parser fails, you can end up
with something like
"Unexpected character '#'; expecting variable name, number iteral or
';'."
(Haskell has to go one better by making P_Name() an *operator* with
really low precidence, so you're not forced to put your whole parser in
brackets.)
## Backtracking ##
In the interests of efficiently, Parsec doesn't actually work in
*exactly* the way I described. Specifically, "backtracking" is
*disabled* by default, and you must explicitly enable it wherever you
want it. This enables the implementation of "infinite look-ahead
grammars", but with the associated overhead only where necessary.
What does all that lot actually mean? Well it means that Fork() doesn't
work exactly like I described. How it actually works is this:
- The first parser is executed.
- If it succeeds, return its result.
- If it fails *without consuming any input*, run the second parser.
- If the first parser consumes some input and *then* fails, the second
parser is not tried, and the whole thing fails.
What this means is that Fork() doesn't have to "remember" where the
first parser started from, and "rewind" the input stream before starting
the second parser. In particular, this "remembering" process would
prevent the GC from freeing the memory being used by the input string,
since Fork() would still hold a pointer to it.
Basically Fork() assumes that a parser can tell from the very first
character of input whether it's going to succeed or while. And if it
fails part-way through, that's a real parsing error. (E.g., a number
parser encountering "123xyz" would be an error.) It will not "backtrack"
and run the second parser if the first one fails after consuming input.
The first parser must fail immediately for that to happen.
But what if you *want* backtracking? Well, then you use the P_Try()
primitive. This takes a parser and returns a parser. It runs that parser
in such a way that if it fails, P_Try() makes it "look like" it consumed
no input. In other words, P_Try() does the "remembering" part (with the
associated cost in RAM usage).
This way, Fork() uses no extra RAM by default, and if you want
backtracking (and the inevitable increase in RAM usage), you can
explicitly ask for P_Try().
You only pay for what you use.
## Assorted notes ##
Note that SuperChain() takes an arbitrary *function* as its second
argument. That function can perform an arbitrarily complex algorithm
before returning a parser for SuperChain() to run. This means that you
can write Turing-complete parsers.
In fact, instead of writing an expression parser that *returns* an
expression, you could in fact write an expression "parser" that actually
*executes* the expression as it parses it. (Whether you want to do that
or not depends on what you're attempting to achieve.)
It also means things like being able to select parsers based on input
already seen. Want a parser that can detect when an undeclared variable
name appears? Parsec makes it quite easy. I won't go into exactly how,
but suffice it to say that Parsec can track user-defined data along with
its own internal state data (e.g., current line number, etc).
# Lies! #
OK, yeah, I admit it. I lied! I said there would be no Haskell syntax,
but ">>" actually is the real name of the real Haskell function that
works exactly the way I described. And SuperChain() is actually called
">>=". And P_Succeed() is called "return". (Notice how that what most
parsers use it for - to change the parser's return value.) Finally,
P_Fail() is actually just "fail".
Together, (>>), (>>=) and (return) comprise Haskell's much-feared
"monad" construct. If you understood the explanation of SuperChain(),
you just understood how monads work!
Parsec parsers form a monad, and Parsec is therefore a *monadic*
combinator library. (A very common form.)
The reason that Haskell has the magical syntax for using SuperChain() is
that in Haskell, I/O is *also* a monad, and I/O is a Rather Common Task.
(!) Hence the syntax sugar.
A surprising number of other things turn out to be monads as well - even
things that don't seem remotely similar. But that's a topic for another
day...
# Financial contracts #
Lest anybody think that combinator libraries are only good for building
parsers, let me *briefly* present a second example.
Some while ago, apparently a group of Haskell hackers examined the
problem of describing and valuing financial contracts. Their results
evidently caused a bit of a stir in the financial community. (I don't
know to what extent the authors are just blowing their own trumpet.)
What they wrote is a Haskell combinator library for combining contracts.
They started with the idea that a contract has a financial value
(negative or positive). Then they added a combinator that makes the
value of a contract change at a certain point in time.
Next they added a combinator that makes a new contract who's value at
any time is equal to the sum of the values of its two subcontracts at
that time. (So, this is similar to Chain().) And then they added a
combinator that allows you to select one contract *or* the other, but
not both. (So this is like Fork().)
It should be easy to see that with these few simple operations, it's
quite easy to build highly complex contracts. It should also be utterly
clear that algorithmically determining a contract's value at any point
in time, given any particular combination of selections, is a pretty
simple task.
The authors claim that their work was hailed as "visionary". Apparently
no system had hitherto existed for clearing and precisely stating the
properties of contracts in an abstract and mathematically rigorous way.
I don't know if any of that is true, but I *do* know that when you're
trying to solve highly complex problems, being able to precisely define
what the problem actually *is* in the first place would be a very
significant step forward.
Post a reply to this message
|
|