Re: lex/yacc efficiency tradeoffs

Hi Joe,

Joe Chisolm wrote:
On Fri, 23 Jul 2010 10:16:31 -0700, D Yuniskis wrote:

Hi Peter,

Peter Keller wrote:
D Yuniskis <> wrote:
Does anyone have any rule-of-thumb guidelines for the relative
efficiencies of doing promotions in the lexer vs the formal grammar? Both in terms of time and space?

E.g., imagine:

[1-9][0-9]{2,4} return VALUE;

in the lexer vs. the same explicit definition in the *grammar*? (i.e., using ". return (int) yylex[0]" in the lexer)

Are there any easily identifiable criteria that affect the tradeoff
between the two approaches?

Is there any (significant) difference in lex/yacc implementations wrt
this issue (e.g., vs flex/bison)?

I can run some empirical tests but I am not expert enough on either to
know how to push the limits with test cases...
Unless your compiler pass is compiling 24/7 and has to be real-time
like, you absolutely shouldn't care and should write your compiler to
be the most maintainable thing ever. Speed doesn't matter for 99% of
compilation passes.
(sigh) Sorry, I should have clarified. I'm not using them to write a
compiler but, rather, a parser for "messages". While speed isn't a real
issue (though I wouldn't want it to be a *pig*!), I *am* concerned with
size and stack penetration, etc. "footprint"

As such, I am concerned with how often the machine "backtracks", how
much state it carries, etc.

This depends on your REs in the lexer and your grammar. From your other
posts it seems you are at the stage where you can control the error
message syntax and thus your grammar. This is a good thing. If you

Well, in some cases, I have complete control over the grammar.
In other cases, I am interfacing to existing protocols so I'm
stuck with whatever was thrown together (usually with little
concern for this sort of thing!) previously.

can put a lot of terminal symbols in your grammar you can control how
much back tracking the grammar has to do. Back tracking in your lexer is a lot dependent on the longest match first rules and your REs.

For simple grammars it can sometimes be better to let the lexer be
just a tokenizer and push some of the rules back into the grammar. Example

Currently, I'm experimenting with the lexer doing damn little
at all! E.g., things like:

\r return CR
\n return LF
[folks seem to more readily relate to seeing "CR LF" in messages]

\040 return SP
[eliminate the ambiguity of ' ' being a space or a coincidental tab]

[\200-\277] return NONASCII
[8b clear channel can see bogus "characters" if UART misconfigured]

.. return (int) yylex[0];
[push everything up to the formal grammar]

But, doing this (last rule) leaves me worrying: "lex exists for a
reason; I've just made it redundant! :< "

your digit RE above. But bottom line is the number of messages you will
have to process per unit of time. Unless you have a complex grammar I
dont think your memory foot print is going to be much of a issue vs a
hand coded parser.

I'm more concerned with rules that have long chains of terminals
that can force lots of (consecutive) states into the FSM. I've
not looked at the actual code generated, yet to see how well
yacc "folds" those tables (implication analysis). Nor if it is
clever enough to shrink the "other" tables to their minimum sizes
(i.e., omit the "can't happen" state transitions).

For example, parsing a fixed width data field:

value: [1-9][0-9][0-9][0-9][0-9]'.'[0-9][0-9]
| SP [1-9][0-9][0-9][0-9]'.'[0-9][0-9]
| SP SP [1-9][0-9][0-9]'.'[0-9][0-9]
| SP SP SP [1-9][0-9]'.'[0-9][0-9]
| SP SP SP SP [1-9]'.'[0-9][0-9]
| undefined

Note these sorts of rules are more expensive than
a more typical:

value: [0-9]+.[0-9]+

As I say, lex exists for a reason. Which god am I angering by
failing to pay it tribute? :-/