Re: Stripping html



Bill Latvin (in 448d0fe5.1062342515@xxxxxxxxxxxxxxxx) said:

| On Sun, 11 Jun 2006 21:46:03 -0500, "Morris Dovey"
| <mrdovey@xxxxxxxx> wrote:
|
|| Medros (in 1150079409.565305.268380@xxxxxxxxxxxxxxxxxxxxxxxxxxxx)
|| said:
||
||| I understand that you can strip html out of a txt file so that all
||| the information is left is the visable information that is needed
||| (e.g. everything that has < > around is gone). My question is that
||| I have a table of information that I need to be fed into a program
||| as such. Well kind of I need the program to read it just as you
||| would on paper and be able to use that information like it was
||| entered. I am unsure how strip so much away just to leave me with
||| the information I want and then use it like I want. Any help?
||
|| Start with a simple program that reads and saves one character at a
|| time looking for a '<' character. When it finds a '<', it should
|| throw it (and following characters) away until it finds a '>'.
|| When the program reaches end-of-file, hopefully it's saved what
|| you want to keep.
||
| I remember starting with a simple program like that, and finding to
| my dismay that between the "script" and "/script" tags the '<' and
| '>' characters are used not as tag delimiters but as "greater than"
| and "less than" comparison operators. I had to check for those
| particular tags and discard everything between them, and not let
| the presence of a lone unbalanced '<' in the script cause my logic
| to miss finding the "/string" tag.

Welcome to the club. It's because of things like that that I added my
second paragraph:

"You'll probably discover that you want to add refinements (perhaps to
deal with HTML encodings like &nbsp; and &lt; - but those can wait on
getting the initial version working."

The refinements will depend on whether the OP wants a general solution
or just enough to extract data from one particular page. On
re-reading, I'd guess is that <table>, <tr>, and <td> tags may be his
1st refinement - but the question indicated that he'll probably need
to start at the most basic level.

--
Morris Dovey
DeSoto Solar
DeSoto, Iowa USA
http://www.iedu.com/DeSoto


.



Relevant Pages

  • Re: Stripping html
    ... | I understand that you can strip html out of a txt file so that all ... | the information is left is the visable information that is needed ... time looking for a '<' character. ... "/string" tag. ...
    (comp.lang.c)
  • Re: Stripping html
    ... Medros ... | I understand that you can strip html out of a txt file so that all ... time looking for a '<' character. ... program reaches end-of-file, hopefully it's saved what you want to ...
    (comp.lang.c)
  • renee.rtf.xab

    (comp.lang.tcl)
  • Re: Introducing bit-part characters
    ... guy one in the crowd, second speaker in the crowd, yet ... it would be out of character not to know. ... little man spoke with the air of a professor calmly lecturing: ... My evil overlords tag people like "the pretty Farseer, ...
    (rec.arts.sf.composition)
  • Re: How to call this function
    ... This is the procedure format that FastTagReplace expects to find - its rather ... Tag is the tag it has found as a string between the TagStart and TagEnd strings ... all the appropriate characters in the sub-string. ... character in the sub-string it has found an occurence of the whole sub-string. ...
    (comp.lang.pascal.delphi.misc)