Re: Stripping html
- From: "Morris Dovey" <mrdovey@xxxxxxxx>
- Date: Mon, 12 Jun 2006 06:07:10 -0500
Bill Latvin (in 448d0fe5.1062342515@xxxxxxxxxxxxxxxx) said:
| On Sun, 11 Jun 2006 21:46:03 -0500, "Morris Dovey"
| <mrdovey@xxxxxxxx> wrote:
|
|| Medros (in 1150079409.565305.268380@xxxxxxxxxxxxxxxxxxxxxxxxxxxx)
|| said:
||
||| I understand that you can strip html out of a txt file so that all
||| the information is left is the visable information that is needed
||| (e.g. everything that has < > around is gone). My question is that
||| I have a table of information that I need to be fed into a program
||| as such. Well kind of I need the program to read it just as you
||| would on paper and be able to use that information like it was
||| entered. I am unsure how strip so much away just to leave me with
||| the information I want and then use it like I want. Any help?
||
|| Start with a simple program that reads and saves one character at a
|| time looking for a '<' character. When it finds a '<', it should
|| throw it (and following characters) away until it finds a '>'.
|| When the program reaches end-of-file, hopefully it's saved what
|| you want to keep.
||
| I remember starting with a simple program like that, and finding to
| my dismay that between the "script" and "/script" tags the '<' and
| '>' characters are used not as tag delimiters but as "greater than"
| and "less than" comparison operators. I had to check for those
| particular tags and discard everything between them, and not let
| the presence of a lone unbalanced '<' in the script cause my logic
| to miss finding the "/string" tag.
Welcome to the club. It's because of things like that that I added my
second paragraph:
"You'll probably discover that you want to add refinements (perhaps to
deal with HTML encodings like and < - but those can wait on
getting the initial version working."
The refinements will depend on whether the OP wants a general solution
or just enough to extract data from one particular page. On
re-reading, I'd guess is that <table>, <tr>, and <td> tags may be his
1st refinement - but the question indicated that he'll probably need
to start at the most basic level.
--
Morris Dovey
DeSoto Solar
DeSoto, Iowa USA
http://www.iedu.com/DeSoto
.
- References:
- Stripping html
- From: Medros
- Re: Stripping html
- From: Morris Dovey
- Re: Stripping html
- From: Bill Latvin
- Stripping html
- Prev by Date: Re: EXECryptor software protection
- Next by Date: Re: Stripping html
- Previous by thread: Re: Stripping html
- Next by thread: Re: Stripping html
- Index(es):
Relevant Pages
|