Re: embarrassing spaghetti code needs stylistic advice
- From: luser-ex-troll <mijoryx@xxxxxxxxx>
- Date: Sat, 21 Mar 2009 06:59:08 -0700 (PDT)
On Mar 21, 4:57 am, gw7...@xxxxxxx wrote:
On 20 Mar, 03:12, luser-ex-troll <mijo...@xxxxxxxxx> wrote:
I have a problem of a somewhat different kind than the usual post. My
code works! It's just appallingly ugly. With my attention focused
sharply on clear and consistent data structures, the most important
function in my nascent postscript interpreter, the lexical scanner,
has degenerated into spaghetti.
The book "BCPL: the language and its compiler", by Martin Richards and
Coklin Whitby-Strevens, includes the code for a lexical scanner, with
plenty of comments about it. If you can get hold of this book, it
might be worth reading. BCPL is a fore-runner of C.
I've only skimmed through your code, but a few points struck me:
int Snext(Object s) {
return sgetc(s.u.s);
}
void Sback(int c, Object s) {
s.u.s->length++;
*(--(s.u.s->s)) = c; //back it up, follow the pointer, store
}
You're worryuing here about reading values and then needing to back up
a bit. The code in the book avoids the need to do this. There is a
function called RCH which will read a character in and put it in a
global variable CH. (Yes, I know globals are disapproved of these
days, but bear with me...) The function NEXTSYMB, which reads the next
tioken from the source, assumes that there is a character in CH which
has not yet been processed. It processes it, reaing more characters if
necessary using RCH, and it calls RCH at least once so that, when it
finishes, there is an unprocessed charcter left in CH. Thus you just
need to call RCH once at the beginning, and then you call NEXTSYMB
continually to get the tokens.
For example, suppose that a '<' character can either be the start of
<=, the start of <<, or just a less-than sign. You do the following
(I've converted this bit into C, and also fixed a couple of stylistic
points):
case '>':
RCH();
if (CH == '=') { RCH(); return S_LE; }
if (CH == '>') { RCH(); return S_LSHIFT; }
return S_LS;
Either way, the first unprocessed character is in CH afterwards.
#define NEXT if ((i=next(src)) == EOF) goto fail
#define NEXTor if ((i=next(src)) == EOF)
You seem very worried about reading in an EOF character. This seems a
bit unnecessary. At any point in the processing, it seems that one of
three things can be the case:
a) the characters you have processed so far need to be followed by
something of a specific type, and it is an error if they're not;
b) the characters you have processed so far may or may not be followed
by something of a specific type, if they are then process that, if
not, leave what follows to be processed next time round;
c) the characters you have processed so far are complete in themselves
and what follows is something separate.
In none of these cases does there seem to be any need to check
specifically whether what follows includes an EOF. Simply treat it as
any character which is different from what is allowed to follow.
This may in fact improve any error messages that you show the user -
there will be more of "You should have provided a ***, and didn't" and
less of "Unexpected end of file".
You should only need to worry about an EOF if i is EOF when you start
the loop - in which case you simply return null.
Specifically the problem is the toke function
which scans a string or file to create an object
(tag-union, variant-record). It's constructed
as a series of tests and loops within a big loop,
but uses goto to change its mind about what
type of object it has found (eg. '+' followed
by a digit is a noise character introducing the
number, but followed by anything else, it's an
executable name).
if(i == '+') { //optional +
NEXTor goto single;
if(!isdigit(i)) { BACK; i = '+'; goto aname; }
i -= '0';
goto digit; }
This doesn't really seem necessary. Either '+' is followed by a digit,
or it isn't. If it is, the digit (and any subsequent digits) are
processed exactly the same way as if the '+' wasn't there. So I think
you may just need a "continue" here intead of the "goto digit" - start
the processing off again, this time looking at the first digit rather
than the '+'. It may mean testing whether the first digit is a digit
twice, but that's hardly the biggest waste on the planet, is it?
[If you do read the book, note that it is itself not perfect. For one
thing, NEXTSYMB returns its result by a global, which seems an
unnecessary piece of horribleness. Also, instead of the neat code
above, it actually uses RETRUN to leave the function without doing a
RCH, and BREAK to leave the SWITCHON (equivalent to a switch) where it
hits a RCH at the end of the function.]
Anyhow, hope that helps.
Paul.
Yes. I'll add that to my bookfetch list at alibris. BCPL was
interpreted, wasn't it?
As far as improving the error messages, I'm somewhat restricted by the
behavior dictated by the Adobe spec, but I think I can add a field of
extra detail into the report. The error function I posted is just a
stub.
The big stumbling block, as I see it now, is my use of 3 kinds of test
on the character in question: if (i == 'x'), strchr("string", i), and
isalpha(i). It seems if I just pick one, I can organize the tests into
a grammar structure and drastically simplify the code.
Maybe it's time to draw flowcharts...
lxt
.
- Follow-Ups:
- Re: embarrassing spaghetti code needs stylistic advice
- From: luser-ex-troll
- Re: embarrassing spaghetti code needs stylistic advice
- References:
- embarrassing spaghetti code needs stylistic advice
- From: luser-ex-troll
- Re: embarrassing spaghetti code needs stylistic advice
- From: gw7rib
- embarrassing spaghetti code needs stylistic advice
- Prev by Date: Re: "Portable" C compilers?
- Next by Date: Re: receiving pointer through function parameter similar to argv
- Previous by thread: Re: embarrassing spaghetti code needs stylistic advice
- Next by thread: Re: embarrassing spaghetti code needs stylistic advice
- Index(es):
Loading