|
![](/i/fill.gif) |
On 09/04/2013 02:34 PM, Warp wrote:
> The fastest possible way of parsing a text file with human-readable syntax
> is to load it into an array in memory (if the file is exceedingly large
> you'll have to do it in sections, but that shouldn't slow it down in any
> significant way) and then parse it in-place (ie. without pulling out any
> parts of it or creating new objects just to parse the thing) in a linear
> fashion (ie. you traverse the array from beginning to end, without ever
> going back.)
>
> It is possible to do this kind of parsing without any kind of recursion,
> but it becomes a bit complicated. Basically you need a state machine (which
> has been generated at compile time into your program), and usually this
> means generated code. (I suppose you could write a state machine by hand,
> but it quickly becomes complicated, especially if the syntax of the thing
> you are parsing is complex.)
Interestingly, if your parser is a monad, then you can never transform
it into a finite state machine. Monads are just "too flexible" to permit
this.
To be concrete: I can write a single Parsec parser which parses a file
consisting of a BNF grammar specification followed by text written in
that grammar. In other words, a parser which dynamically constructs a
new parser and then runs it. Obviously there is no hope of statically
optimising such a thing.
In Haskell we use monads to perform I/O, and for that task this level of
flexibility is exactly what is wanted. We *want* programs to change
their behaviour depending the input they receive. But for the problem of
parsing a fixed grammar, that's actually a little bit *too* flexible.
Sometimes we would like something a bit less powerful, because often
less is more with these things. (In this case, less expressive power
equals more optimisation capabilities.)
The problem with monads is that the bind operation feeds the result of
one step into some arbitrary Turing-complete code block which then
generates the next parser to run. This Turing-completeness is the source
of the trouble; it means you can't statically analyse what the parser is
going to try to do (c.f. Rice's theorem).
There are alternatives, however. In particular, instead of using a monad
to do parsing, you can use something called an "applicative functor".
(No, I'm not kidding. That's what they call it.)
Every monad is also an applicative functor. However, not all applicative
functors are also monads. For the purpose of parsing, it's the ones that
aren't monads which are interesting.
A few definitions. In Haskell terms, a "functor" is basically anything
that has a Map() function. You'll recall that Map() takes a list and a
function, and applies that function to every element of the list,
resulting in a new list. Well, in a slightly mind-binding kind of way,
and I/O operation is *also* a functor, because you can take (say) an I/O
operation that returns a byte, and Map() a function from bytes to
characters over it, resulting in a new I/O operation that returns a
character. And, indeed, *every* monad is also a functor in this way.
So much for functors. If you have something that's a functor, and all
you know about it is "it's a functor", then that lets you Map() it, and
that's all you can do. This isn't terribly interesting on its own.
Now, an *applicative* functor has two additional abilities. One is that
it has a function like Inject(), which puts a value "into" the functor.
The other is a function that I'm going to call Apply(). It's a bit like
Map(). But whereas Map() takes a function and a functor containing a
value [or values], Apply() takes a functor of functions and a functor of
values.
For example, if Parser<X> is a parser that parses an X, then
public Parser<T2> Apply(Parser<Func<T1, T2>> p1, Parser<T1> p2)
First p1 is run, then p2 is run, then the function returned by p1 is
executed with the result from p2 as its argument. Whatever it returns,
the composite parser returns.
My word, that seems mighty complicated. And why would you want a parser
that returns a function anyway?
Well, Map() lets you apply a function to *one* argument. But what if you
have (say) a 4-argument function, and you want to run 4 parsers to
generate each of the inputs? Well, you can do
Map(f, p1)
That partially applies our 4-argument function "f" to the first
argument... But now we've got a parser that returns a 3-argument
function. We're stuck; Map() wants a function, not a parser that returns
a function.
Now if our parser forms a monad, this is where we invoke Bind() to chain
in the next step. But that's actually overkill; all we want is to be
able to take a function from a parser. And that's all Apply() does. So
we can do this:
temp1 = Apply(f, p1);
temp2 = Apply(temp1, p2);
temp3 = Apply(temp2, p3);
temp4 = Apply(temp3, p4);
Now temp4 is a parser that parses 4 things and then applies f to the
results. In Haskell, it's not called Apply(), it's called "<*>", which
is an infix operator, so we can just write
temp4 = pure f <*> p1 <*> p2 <*> p3 <*> f4
which looks almost the same as
result = f v1 v2 v3 v4
but with a bunch of <*> operators inserted between arguments. There's
also a "<*" operator, which allows us to run a parser and skip its output:
parser = pure f <*> arg1_parser <*> arg2_parser <* separator_parser
<*> arg3_parser
This, by itself, doesn't allow us to parse alternatives. (But then,
neither does a monad by itself.) We need extra combinators to handle
that, and a few other things which are specific to parsers.
Using this framework, the parser can be represented not as an opaque
block of Turing-complete code, but as a transparent data structure which
can be operated on by other functions. This means you can do run-time or
compile-time optimisations on it. In particular, you could write a
function that takes a parser and generates a finite state machine from it.
The *other* possibility is to use something called an "arrow", but
that's for another post...
Post a reply to this message
|
![](/i/fill.gif) |