POV-Ray: Newsgroups: povray.off-topic: Very long post: Re: Very long post

POV-Ray : Newsgroups : povray.off-topic : Very long post : Re: Very long post		Server Time 12 Jul 2025 07:56:47 EDT (-0400)

From: Darren New
Date: 23 Sep 2008 11:48:13
Message: <48d90fbd$1@news.povray.org>

Invisible wrote:

I read it all the way through too. Nicely written, good explanation. I'd 
add a bit at the end that points to the online resources you'd need to 
learn it more completely.  (A few spelling mistakes and a dropped word 
or two also, but that's trivia. :-)

> scott wrote:
> 
>> I read it, very interesting seeing as I had no idea how a parser works 
>> (I always wondered how you would go about writing one).
> 
> The more "usual" approach is usually something based around a finite 
> state machine - but as I just explained, Parsec works differently. (And 
> indeed will parse things t FSM couldn't parse - but that you'll often 
> need that.)

Actually, that's not *quite* right.

An FSM will parse what a regular expression parses, which doesn't 
include (for example) matching nested brackets, because there's no 
arbitrarily-large storage area to hold the count of nested brackets.

A "push-down automaton" (or PDA) is one common way of implementing a 
parser. It's a FSM *with* a stack. This is the sort of thing YACC / 
Bison implements. People use it because you can algorithmically generate 
the FSM from the BNF grammar of the language you want to parse.

What Haskell is using is called a recursive-descent parser: the parser 
basically uses the run-time stack in place of an explicit stack. It's 
best for LALR(1) syntaxes, which means "Look Ahead Left to Right 1 
token". I.e., a language you can parse by running strictly from left to 
right, looking only one element ahead, so you don't need the "P_Try()" 
function. Some languages (like Pascal) are explicitly designed this way 
to make it easy to build a recursive descent parser.

> 1. A "tokeniser" takes the input string and splits it into "tokens", 
> possibly decorating them slightly. So now you have a flat stream of 
> tokens instead of just characters.

The advantage of doing it this way is most tokens can be parsed with a 
regular expression, which is much faster (and leads to *much* smaller 
push-down state machines) than doing it with the same PDA as used by the 
rest of the grammar.

> Parsec allows you to parse and then tokenise, or do the whole thing in 
> one pass. It's up to you. (I didn't mention it, but Parsec supports 
> running parsers over lists of *anything*, not just characters. So you 
> could in theory parse raw binary if you wanted...)

Or you could parse a list of tokens coming from the tokenizer, which is 
what I imagine the authors were thinking. :-)

-- 
Darren New / San Diego, CA, USA (PST)

Post a reply to this message