![](/i/fill.gif) |
![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
> You'll be unsurprised to learn that I am currently about 40% of the way
> through implementing Parsec in C#. It turns out to be reasonably easy,
> although there are a few gotchas. (E.g., there is no C# function which
> both inserts an element into a list *and* returns the list. Which is
> annoying, since it seems you can't call void-functions from LINQ...)
It will be interesting to compare the speed with your original...
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
On 08/04/2013 09:02 AM, scott wrote:
>> You'll be unsurprised to learn that I am currently about 40% of the way
>> through implementing Parsec in C#.
>
> It will be interesting to compare the speed with your original...
For most parsing tasks, speed isn't really an issue. Usually you use
parsers to parse human-written text, which is typically small enough
that the time spent parsing is negligible compared to the time spent on
subsequent processing.
(If you wanted to parse a 3GB XML file, you wouldn't use Parsec. You'd
use one of the various XML-specific libraries which are tuned to do that
specific job very efficiently.)
That said, I'm not tuning my C# code for speed at all, and I don't
expect it to be particularly fast. Then again, Parsec defaults to using
Haskell strings, which are very inefficient... It looks like being a
tortoise race.
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
> For most parsing tasks, speed isn't really an issue. Usually you use
> parsers to parse human-written text, which is typically small enough
> that the time spent parsing is negligible compared to the time spent on
> subsequent processing.
I don't know a great deal about parsers (the most I've done is made one
to calculate very simple sums, like (5+4)/2) but AIUI things like POV
first convert the human readable text to binary symbols (removing any
unnecessary formatting) and then parse the binary symbols rather than
the raw text. IDK which part typically takes longer (I guess the first
part as that involves a lot of string searching), but the time POV
spends "parsing" even a medium-sized scene isn't trivial.
Saying that if you're writing something to do single-line equations then
you probably don't need to worry :-) I thought it would just be an
interesting real-world example if you've coded the same algorithm in
both Haskell and C# to see the difference.
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
On 09/04/2013 08:14 AM, scott wrote:
>> For most parsing tasks, speed isn't really an issue. Usually you use
>> parsers to parse human-written text, which is typically small enough
>> that the time spent parsing is negligible compared to the time spent on
>> subsequent processing.
>
> I don't know a great deal about parsers (the most I've done is made one
> to calculate very simple sums, like (5+4)/2) but AIUI things like POV
> first convert the human readable text to binary symbols (removing any
> unnecessary formatting) and then parse the binary symbols rather than
> the raw text.
In other words, "lexing" (AKA "tokenising") followed by "parsing".
> IDK which part typically takes longer (I guess the first
> part as that involves a lot of string searching), but the time POV
> spends "parsing" even a medium-sized scene isn't trivial.
That's because
1) The "parsing" time includes the time taken to expand macros, because
these work at the token level rather than some higher-level abstraction.
2) The "big" scenes are ones which include huge meshes. These
machine-generated constructs tend to be quite large.
> Saying that if you're writing something to do single-line equations then
> you probably don't need to worry :-) I thought it would just be an
> interesting real-world example if you've coded the same algorithm in
> both Haskell and C# to see the difference.
Yeah, why not...
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
scott <sco### [at] scott com> wrote:
> I don't know a great deal about parsers (the most I've done is made one
> to calculate very simple sums, like (5+4)/2) but AIUI things like POV
> first convert the human readable text to binary symbols (removing any
> unnecessary formatting) and then parse the binary symbols rather than
> the raw text.
POV-Ray doesn't convert the SDL file into any internal binary format.
It just parses the text and creates objects as needed.
Fast parsing (regardless of whether you are interpreting as you parse or
whether you are byte-compiling it) is a difficult problem, one I have some
experience with.
The fastest possible way of parsing a text file with human-readable syntax
is to load it into an array in memory (if the file is exceedingly large
you'll have to do it in sections, but that shouldn't slow it down in any
significant way) and then parse it in-place (ie. without pulling out any
parts of it or creating new objects just to parse the thing) in a linear
fashion (ie. you traverse the array from beginning to end, without ever
going back.)
It is possible to do this kind of parsing without any kind of recursion,
but it becomes a bit complicated. Basically you need a state machine (which
has been generated at compile time into your program), and usually this
means generated code. (I suppose you could write a state machine by hand,
but it quickly becomes complicated, especially if the syntax of the thing
you are parsing is complex.)
The easiest way of writing an efficient parser by hand is to implement
a recursive descent parser (which, as said, parses the input in-place
from the array where it has been read to.) A RDP allows parsing any amount
of nested blocks in the input syntax (eg. parentheses or other types of
nested blocks) at any nesting depth, without ever having to go back in the
input (ie. just traversing the input from beginning to end once.)
What you do with the parsed information depends on what you want, but eg.
byte-compiling it into a stack-based bytecode (that can be later interpreted
in a very efficient manner) is very easy with a RDP. You could, of course,
just interpret the input directly.
--
- Warp
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
>> I don't know a great deal about parsers (the most I've done is made one
>> to calculate very simple sums, like (5+4)/2) but AIUI things like POV
>> first convert the human readable text to binary symbols (removing any
>> unnecessary formatting) and then parse the binary symbols rather than
>> the raw text.
>
> POV-Ray doesn't convert the SDL file into any internal binary format.
> It just parses the text and creates objects as needed.
The existence of tokenize.cpp in the source made me think otherwise.
What does this file do then?
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
On 09/04/2013 02:34 PM, Warp wrote:
> The fastest possible way of parsing a text file with human-readable syntax
> is to load it into an array in memory (if the file is exceedingly large
> you'll have to do it in sections, but that shouldn't slow it down in any
> significant way) and then parse it in-place (ie. without pulling out any
> parts of it or creating new objects just to parse the thing) in a linear
> fashion (ie. you traverse the array from beginning to end, without ever
> going back.)
>
> It is possible to do this kind of parsing without any kind of recursion,
> but it becomes a bit complicated. Basically you need a state machine (which
> has been generated at compile time into your program), and usually this
> means generated code. (I suppose you could write a state machine by hand,
> but it quickly becomes complicated, especially if the syntax of the thing
> you are parsing is complex.)
Interestingly, if your parser is a monad, then you can never transform
it into a finite state machine. Monads are just "too flexible" to permit
this.
To be concrete: I can write a single Parsec parser which parses a file
consisting of a BNF grammar specification followed by text written in
that grammar. In other words, a parser which dynamically constructs a
new parser and then runs it. Obviously there is no hope of statically
optimising such a thing.
In Haskell we use monads to perform I/O, and for that task this level of
flexibility is exactly what is wanted. We *want* programs to change
their behaviour depending the input they receive. But for the problem of
parsing a fixed grammar, that's actually a little bit *too* flexible.
Sometimes we would like something a bit less powerful, because often
less is more with these things. (In this case, less expressive power
equals more optimisation capabilities.)
The problem with monads is that the bind operation feeds the result of
one step into some arbitrary Turing-complete code block which then
generates the next parser to run. This Turing-completeness is the source
of the trouble; it means you can't statically analyse what the parser is
going to try to do (c.f. Rice's theorem).
There are alternatives, however. In particular, instead of using a monad
to do parsing, you can use something called an "applicative functor".
(No, I'm not kidding. That's what they call it.)
Every monad is also an applicative functor. However, not all applicative
functors are also monads. For the purpose of parsing, it's the ones that
aren't monads which are interesting.
A few definitions. In Haskell terms, a "functor" is basically anything
that has a Map() function. You'll recall that Map() takes a list and a
function, and applies that function to every element of the list,
resulting in a new list. Well, in a slightly mind-binding kind of way,
and I/O operation is *also* a functor, because you can take (say) an I/O
operation that returns a byte, and Map() a function from bytes to
characters over it, resulting in a new I/O operation that returns a
character. And, indeed, *every* monad is also a functor in this way.
So much for functors. If you have something that's a functor, and all
you know about it is "it's a functor", then that lets you Map() it, and
that's all you can do. This isn't terribly interesting on its own.
Now, an *applicative* functor has two additional abilities. One is that
it has a function like Inject(), which puts a value "into" the functor.
The other is a function that I'm going to call Apply(). It's a bit like
Map(). But whereas Map() takes a function and a functor containing a
value [or values], Apply() takes a functor of functions and a functor of
values.
For example, if Parser<X> is a parser that parses an X, then
public Parser<T2> Apply(Parser<Func<T1, T2>> p1, Parser<T1> p2)
First p1 is run, then p2 is run, then the function returned by p1 is
executed with the result from p2 as its argument. Whatever it returns,
the composite parser returns.
My word, that seems mighty complicated. And why would you want a parser
that returns a function anyway?
Well, Map() lets you apply a function to *one* argument. But what if you
have (say) a 4-argument function, and you want to run 4 parsers to
generate each of the inputs? Well, you can do
Map(f, p1)
That partially applies our 4-argument function "f" to the first
argument... But now we've got a parser that returns a 3-argument
function. We're stuck; Map() wants a function, not a parser that returns
a function.
Now if our parser forms a monad, this is where we invoke Bind() to chain
in the next step. But that's actually overkill; all we want is to be
able to take a function from a parser. And that's all Apply() does. So
we can do this:
temp1 = Apply(f, p1);
temp2 = Apply(temp1, p2);
temp3 = Apply(temp2, p3);
temp4 = Apply(temp3, p4);
Now temp4 is a parser that parses 4 things and then applies f to the
results. In Haskell, it's not called Apply(), it's called "<*>", which
is an infix operator, so we can just write
temp4 = pure f <*> p1 <*> p2 <*> p3 <*> f4
which looks almost the same as
result = f v1 v2 v3 v4
but with a bunch of <*> operators inserted between arguments. There's
also a "<*" operator, which allows us to run a parser and skip its output:
parser = pure f <*> arg1_parser <*> arg2_parser <* separator_parser
<*> arg3_parser
This, by itself, doesn't allow us to parse alternatives. (But then,
neither does a monad by itself.) We need extra combinators to handle
that, and a few other things which are specific to parsers.
Using this framework, the parser can be represented not as an opaque
block of Turing-complete code, but as a transparent data structure which
can be operated on by other functions. This means you can do run-time or
compile-time optimisations on it. In particular, you could write a
function that takes a parser and generates a finite state machine from it.
The *other* possibility is to use something called an "arrow", but
that's for another post...
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
I'm assuming that implementing a recursive descent parser ought to be
relatively simple in Haskell, given how recursive in nature functional
programming is. (Of course this also assumes that it's easy and efficient
in Haskell to have the input as a byte array, and it's efficient to read
said array byte by byte, without much copying things around.)
An interesting thing about this type of parsing, though, is how to parse
things like numeric literals (especially if they are not integers) or
UTF-8 encoded characters.
Most languages have some kind of standard functionality for parsing the
former. The only thing one has to do is to detect when such a literal is
appearing in the input, and then call said function. (The C/C++ standard
libraries in particular offer functions to parse numeric values that are
in ASCII format in-place, without having to first copy them anywhere,
which is a major help when writing such efficient parsers.)
Parsing UTF-8 encoded characters is a bit trickier, especially if the
language does not offer standard utility functions for that purpose.
The most efficient way of doing that is to, basically, write a small
state machine that reads the input byte by byte until a character not
in a valid sequence appears, and returns how many bytes the found name
consists of. (During this parsing, or afterwards, you can then do with
those characters whatever you need to.)
(Remember that we are writing an extremely *efficient* parser here. That
means no copying characters around and parsing them separately. They have
to be parsed in-place, where they appear in the byte array.)
--
- Warp
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
On 20/04/2013 10:31 PM, Warp wrote:
> I'm assuming that implementing a recursive descent parser ought to be
> relatively simple in Haskell, given how recursive in nature functional
> programming is.
Indeed.
> (Of course this also assumes that it's easy and efficient
> in Haskell to have the input as a byte array, and it's efficient to read
> said array byte by byte, without much copying things around.)
*Reading* from an array is easy. You only need to copy things if they
change, and reading from something doesn't change it.
> Most languages have some kind of standard functionality for parsing the
> former.
The trouble starts when the literals you're trying to parse don't match
the format that the standard function expects. (E.g., some languages
allow "+7" as a valid integer, and others don't.)
> Parsing UTF-8 encoded characters is a bit trickier, especially if the
> language does not offer standard utility functions for that purpose.
Haskell provides a way to read from a file, decoding UTF-8 on the fly.
I'm not sure how it's implemented internally though. You end up with a
packed array of characters which you can iterate over.
> (Remember that we are writing an extremely *efficient* parser here. That
> means no copying characters around and parsing them separately. They have
> to be parsed in-place, where they appear in the byte array.)
You're probably going to struggle to get this level of efficiency from a
programming language that uses garbage collection. Even if all the data
is in a flat array, the array might get copied from place to place
occasionally by the garbage collector.
(Having said that, I'm not sure whether or not Haskell's GC does any
special handling for "large" objects. Some GCs handle these specially,
which usually means storing them in a separate area that doesn't get
scanned as often, so they don't need to be copied as much.)
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
On 24/04/2013 08:46 AM, Orchid Win7 v1 wrote:
> Having said that, I'm not sure whether or not Haskell's GC does any
> special handling for "large" objects.
It does. It takes various steps to try to avoid copying objects larger
than about 4KB.
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |