Re: can I know how to write a html parser in C

From: Walter Roberson (roberson_at_ibd.nrc-cnrc.gc.ca)
Date: 02/24/05


Date: 23 Feb 2005 23:02:22 GMT

In article <1109197203.199764.207570@z14g2000cwz.googlegroups.com>,
WUV999U <usbharath.ganesh@gmail.com> wrote:
:that was a great suggestion Jarmo. As you said, I can use Perl.
:But m afraid m not used to it.

Too bad, it'd be faster to write the program.

:I need to get this done in a day or so..
:If I use C, how do I go about it?

State machine.

set state = 0
On each iteration of the loop, fetch one character

State 0: if the character is < then set matchlen = 0 and transit to state 1
otherwise discard the character and stay in state 0

State 1: If you are in state 1 and tolower(character) is
"img"[matchlen] then matchlen++; if matchlen=3 then transit to
state 2 else stay in state 1
else if the character is ! then transit to state 3 else transit to state 4

State 2: recognize and discard whitespace (including newline).
When you get the first non-whitespace character, then if you had
no whitespace or if tolower(character) is not 'h' then transit to state 4
else transit to state 5

State 3: you might be in a comment. Do what you need to to figure out
if you have a valid start of comment. When you have determined that you
do, go to state 6; if you don't, go to state 4

State 4: you are either not in an IMG tag or you are recovering from
an error. In either case, you are not presently in quotes. accept and
discard characters until you either get a '>' or you hit quotes; if you
hit quotes, transit to a quote-absorbtion state

State 5: you have recognized up to "<img h". recognize and accept
characters that match "ref=\"" and then enter url acceptance mode;
if you hit something else, go to state 4

State 6: you are in a comment. accept and discard all characters until
you find an end-of-comment marker or you find quotes. At end of comment
go back to state 0; at quotes, go to state 7; otherwise stay in state 6

State 7: you are inside quotes inside a comment. accept and discard
all characters until you find an unescaped end of quote. When you
do, go back to state 6; until then stay in state 7

And so on. You can see the general outline -- and you can see some
of the complications. You must account for comments! You must account
for the possibility that what looks like the end of a comment is in
the middle of a quoted string! You should probably take into account
whether you are in an OBJ or javascript, since any IMG in those are
not necessarily going to be shown. You should probably take into
account that if you are within a LAYER that the layer might not be
visible. You should probably take into account that if you are
inside a FRAMES section that nothing there will ever be displayed:
FRAMES sections can only have references to the frame files they import.
You should probably take into account that if you are within a
FRAMES section that you should be chasing the URLs named there because
images referenced in them will be shown. You should probably take
into account that an IMG reference in a HEAD section will not be
displayed. And probably two or three fortnights worth of more complications.

Frankly, if your C and programming experience is not strong enough
that you didn't know how to go about starting this, then there is
virtually no chance that you can properly impliment it in C within
your "day or two" timeframe. HTML parsing has lots of Gotcha!'s.

It would probably be faster for you to learn the rudiments of Perl and
call upon the LWP moudle to extract the IMG tags for you, then it would
be for you to write the parser in C.

-- 
If a troll and a half can hook a reader and a half in a posting and a half, 
how many readers can six trolls hook in six postings?


Relevant Pages

  • Re: Non-printing ASCII characters in WinXP Home, Admin Password - cant do it?
    ... Hi Frank, ... When giving the account a password, ... > (without the quotes), and then I held down the Alt key on the ... > non-printing character in it for a Limited Account on WinXP Home ...
    (microsoft.public.windowsxp.security_admin)
  • Non-printing ASCII characters in WinXP Home, Admin Password - cant do it?
    ... I created a Limited User account named "Test" (without ... (without the quotes), and then I held down the Alt key on the ... After all, isn't Alt008 a non-printing ... non-printing character in it for a Limited Account on WinXP Home ...
    (microsoft.public.windowsxp.security_admin)
  • Re: PING: Former AGDers
    ... Remember, if the wife gets addicted, you can register another account ... with the *same* personal info and transfer the wife's character to the ... Druids have absolutely no issue finding parties compared to hunters. ... if you want to discover the teamplay fast, go build a druid, you'll be ...
    (alt.games.warcraft)
  • Re: Beginners Program
    ... I'd put single quotes aroung the, ... If you are putting "s in a string then it is usually best to use a different ... input file things could go wrong. ... and that the final line of input does end with a newline character. ...
    (comp.lang.perl.misc)
  • Re: Account permanently disabled?
    ... Hope this helps and good luck to getting your account restored. ... I was under the understanding that paid character transfers ... get my items and gold restored as well as get the transfer reversed. ...
    (alt.games.warcraft)