|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> 1. A "tokeniser" takes the input string and splits it into "tokens",
> possibly decorating them slightly. So now you have a flat stream of tokens
> instead of just characters.
Do you need to have some sort of grammar rules though before you can
tokenise? I mean what if you have the string "-3*(4-3e-2)+5*-3", the
tokeniser (if I understand right) needs to correctly decide how to interpret
those minus signs, needing some rules about what is surrouding it etc. What
would be the best way to convert an ASCII string like that into a list of
tokens?
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
scott wrote:
> Do you need to have some sort of grammar rules though before you can
> tokenise?
Yes, definitely.
> I mean what if you have the string "-3*(4-3e-2)+5*-3", the
> tokeniser (if I understand right) needs to correctly decide how to
> interpret those minus signs, needing some rules about what is surrouding
> it etc. What would be the best way to convert an ASCII string like that
> into a list of tokens?
Ah yes, the old "is it unary minus or binary minus?" question.
In this case, you'd probably let minus be a token by itself, and let the
parser decide whether it's unary or binary based on the context when it
builds the parse tree.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
And lo on Tue, 23 Sep 2008 12:32:15 +0100, Invisible <voi### [at] devnull> did
spake, saying:
> # Preface #
>
> OK, so the muse has taken me. I want to write something.
[smack] Blog!
--
Phil Cook
--
I once tried to be apathetic, but I just couldn't be bothered
http://flipc.blogspot.com
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Phil Cook wrote:
> [smack] Blog!
Meh. Like anybody will read it! :-P
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> scott wrote:
>
>> Do you need to have some sort of grammar rules though before you can
>> tokenise?
>
> Yes, definitely.
To see this, consider that in many programming languages, "foo_bar" is a
single identifier. However, in TeX source code, "_" is [usually] a
command name and hence should be parsed as a seperate token.
So what constitutes a "token" completely depends on exactly what you're
trying to tokenise/parse.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
scott wrote:
> What would be the best way to convert an ASCII string like that
> into a list of tokens?
Typically, it's done with regular expressions. And typically, "-37" is
two tokens in programming language compilers, at least.
--
Darren New / San Diego, CA, USA (PST)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> Typically, it's done with regular expressions. And typically, "-37" is two
> tokens in programming language compilers, at least.
OK, and then is there only one token for "minus", or are there two for unary
and binary minus? ie does the parser decide or the tokeniser?
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
scott wrote:
> OK, and then is there only one token for "minus", or are there two for
> unary and binary minus? ie does the parser decide or the tokeniser?
Varies depending on the rules of whatever you're trying to process, but
typically it's the parser.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
>> OK, and then is there only one token for "minus", or are there two for
>> unary and binary minus? ie does the parser decide or the tokeniser?
>
> Varies depending on the rules of whatever you're trying to process, but
> typically it's the parser.
I'm just curious, because I made a parser like this in C++ once (it was very
hacky and basically just stepped along the string trying to identify what
each byte was). Anyway, it worked ok for things like "-5*(4+2)" etc, but
crashed with "-(4+2)". I guess the minus operator should be encoded as its
own token and then let the parser sort out what it should do. Maybe I'll
try a rewrite one day.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
scott wrote:
> I'm just curious, because I made a parser like this in C++ once (it was
> very hacky and basically just stepped along the string trying to
> identify what each byte was). Anyway, it worked ok for things like
> "-5*(4+2)" etc, but crashed with "-(4+2)". I guess the minus operator
> should be encoded as its own token and then let the parser sort out what
> it should do. Maybe I'll try a rewrite one day.
Yeah, this is one of the tricky edge-cases of expression parsing.
Operator precidence and unary/binary operators can get pretty ugly.
(Gets even harder if you want to report meaningful error messages if
there's an actua syntax error...)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |