Re: can I know how to write a html parser in C
From: Walter Roberson (roberson_at_ibd.nrc-cnrc.gc.ca)
Date: 02/24/05
- Next message: some: "newbie question"
- Previous message: infobahn: "Re: bitwise operators"
- In reply to: WUV999U: "Re: can I know how to write a html parser in C"
- Next in thread: Daniel Bruce: "Re: can I know how to write a html parser in C"
- Reply: Daniel Bruce: "Re: can I know how to write a html parser in C"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 23 Feb 2005 23:02:22 GMT
In article <1109197203.199764.207570@z14g2000cwz.googlegroups.com>,
WUV999U <usbharath.ganesh@gmail.com> wrote:
:that was a great suggestion Jarmo. As you said, I can use Perl.
:But m afraid m not used to it.
Too bad, it'd be faster to write the program.
:I need to get this done in a day or so..
:If I use C, how do I go about it?
State machine.
set state = 0
On each iteration of the loop, fetch one character
State 0: if the character is < then set matchlen = 0 and transit to state 1
otherwise discard the character and stay in state 0
State 1: If you are in state 1 and tolower(character) is
"img"[matchlen] then matchlen++; if matchlen=3 then transit to
state 2 else stay in state 1
else if the character is ! then transit to state 3 else transit to state 4
State 2: recognize and discard whitespace (including newline).
When you get the first non-whitespace character, then if you had
no whitespace or if tolower(character) is not 'h' then transit to state 4
else transit to state 5
State 3: you might be in a comment. Do what you need to to figure out
if you have a valid start of comment. When you have determined that you
do, go to state 6; if you don't, go to state 4
State 4: you are either not in an IMG tag or you are recovering from
an error. In either case, you are not presently in quotes. accept and
discard characters until you either get a '>' or you hit quotes; if you
hit quotes, transit to a quote-absorbtion state
State 5: you have recognized up to "<img h". recognize and accept
characters that match "ref=\"" and then enter url acceptance mode;
if you hit something else, go to state 4
State 6: you are in a comment. accept and discard all characters until
you find an end-of-comment marker or you find quotes. At end of comment
go back to state 0; at quotes, go to state 7; otherwise stay in state 6
State 7: you are inside quotes inside a comment. accept and discard
all characters until you find an unescaped end of quote. When you
do, go back to state 6; until then stay in state 7
And so on. You can see the general outline -- and you can see some
of the complications. You must account for comments! You must account
for the possibility that what looks like the end of a comment is in
the middle of a quoted string! You should probably take into account
whether you are in an OBJ or javascript, since any IMG in those are
not necessarily going to be shown. You should probably take into
account that if you are within a LAYER that the layer might not be
visible. You should probably take into account that if you are
inside a FRAMES section that nothing there will ever be displayed:
FRAMES sections can only have references to the frame files they import.
You should probably take into account that if you are within a
FRAMES section that you should be chasing the URLs named there because
images referenced in them will be shown. You should probably take
into account that an IMG reference in a HEAD section will not be
displayed. And probably two or three fortnights worth of more complications.
Frankly, if your C and programming experience is not strong enough
that you didn't know how to go about starting this, then there is
virtually no chance that you can properly impliment it in C within
your "day or two" timeframe. HTML parsing has lots of Gotcha!'s.
It would probably be faster for you to learn the rudiments of Perl and
call upon the LWP moudle to extract the IMG tags for you, then it would
be for you to write the parser in C.
-- If a troll and a half can hook a reader and a half in a posting and a half, how many readers can six trolls hook in six postings?
- Next message: some: "newbie question"
- Previous message: infobahn: "Re: bitwise operators"
- In reply to: WUV999U: "Re: can I know how to write a html parser in C"
- Next in thread: Daniel Bruce: "Re: can I know how to write a html parser in C"
- Reply: Daniel Bruce: "Re: can I know how to write a html parser in C"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|