Re: Tokenizing data based on an EBNF grammar?
- From: mrjohn@xxxxxxxxxxxxx
- Date: 15 Jun 2006 10:09:50 -0700
This sounds better. I think I'm on the right track now, and this is
also how I thought about the problem after you made the distinction for
me between lexing and parsing in your first post.
I agree that SVG falls a little flat with the path sublanguage -- but I
think it's rationalized by precedent: the sublanguage is similar
(perhaps identical?) to PostScript path commands, so if you're going
to/from PS it's easy to ingest/dump the path instructions.
John
Rob Thorpe wrote:
Rob Thorpe wrote:
mrjohn@xxxxxxxxxxxxx wrote:
Chris Uppal is correct, the path data is a sublanguage within the SVG
spec; the EBNF (referred to as BFN) appears within the page I linked
to.
Rob, thank you for clarifying the difference between parsing and
tokenizing/lexing. I suppose what I need to do is lex the data (the
path data sublanguage, not the SVG itself) into tokens... then parse
the tokens.
I have a few different data sources (not just SVG path data) that will
benefit from being run through a Lexer/Parser and outputed XML. Once
the data is XML, I it's pretty flexible to deal with it via XPath and
the DOM parser I normally use for XML. That is why I was hoping for a
general tool to do the job -- I don't want to hand write (and test and
debug) a different parser for each of my data sources.
I will have a look at "Lex/Flex (and maybe Yacc/Bison)" today as Chris
suggested. From what I see so far, Lex & Bison looks promising, I'm not
sure if I need a parser generator (YACC) or a scanner generator (FLEX)
though.
A lexer is what you need. Once you've got the tokens in a form you
like you can turn them into XML. Doing that shouldn't require any
significant parsing.
You could try flex, or writing it manually. I'd expect both would be
fairly simple for this. Its not difficult to learn simple uses of
flex. There are also other tools similar to flex you could try.
Scratch that. I just read the spec.
Your going to need a lexer and a parser.
The rules are suitable for putting in the lexer:-
nonnegative-number: integer-constant | floating-point-constant
number: sign? integer-constant | sign? floating-point-constant
flag: "0" | "1"
comma: ","
integer-constant: digit-sequence
floating-point-constant: fractional-constant exponent? | digit-sequence
exponent
fractional-constant: digit-sequence? "." digit-sequence |
digit-sequence "."
exponent: ( "e" | "E" ) sign? digit-sequence
sign: "+" | "-"
digit-sequence: digit | digit digit-sequence
digit: "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
wsp: (#x20 | #x9 | #xD | #xA)
The rest should go in the parser.
It isn't clear where "comma-wsp: (wsp+ comma? wsp*) | (comma wsp*)"
should go, but probably in the parser.
You could do this with flex & bison, or with antlr or several other
tools.
Lexer generators are fairly easy to learn by parser generators are
(necessarily) hairy ,hence their names.
I must say I think this bit of the SVG spec is stupid. XML provides
them with a language (albeit not that nice a language) and the SVG spec
goes and implants another inside it. And that language isn't very nice
either!
.
- References:
- Tokenizing data based on an EBNF grammar?
- From: mrjohn
- Re: Tokenizing data based on an EBNF grammar?
- From: Rob Thorpe
- Re: Tokenizing data based on an EBNF grammar?
- From: mrjohn
- Re: Tokenizing data based on an EBNF grammar?
- From: Rob Thorpe
- Re: Tokenizing data based on an EBNF grammar?
- From: Rob Thorpe
- Tokenizing data based on an EBNF grammar?
- Prev by Date: Re: career advice - what to do in this situation
- Next by Date: Re: editor implementation (span table?)
- Previous by thread: Re: Tokenizing data based on an EBNF grammar?
- Next by thread: Getting to a number from a sequence
- Index(es):
Relevant Pages
|