Re: HTML tag Parsing and extracting data.
- From: chihung@xxxxxxxxxxxxxx
- Date: Tue, 30 Oct 2007 17:58:03 -0700
On Oct 29, 9:45 pm, schl...@xxxxxxxxxxxxxxxx wrote:
majidkha...@xxxxxxxxx wrote:
Hi,
I am new in TCL . Let me tell you what I am want to do which so far I
am trying to but failed.
I would like to parse an HTML page say it is .
http://www.cmcelectronics.ca/En/Careers/job_display_en.php?JOB_ID=511
or
http://www.sanjel.com/careers/jobDesc.cfm?numJobBoardID=440
and I want to extract the data/info of "Duties & Responsibilities:" ,
"Description:" , "Summary" or "Responsibilities" and etc etc..
So I am looking for the code which should be generic enough in a sense
that if we pass "descriptions" or "description" or any keyword whic I
mentioned above or could be any then it looks and extract the
information related to that keyword or heading..
Basically 'generic' is hard in this regard due to the way HTML turned
into a tag soup and isn't properly annotated at all (even harder if
javascript is involved).
A simple sledgehammer would be regexp..., a little more sophisticated
something like tcllib htmlparse or tdom in html mode. Even tclwebtest
might be helpful.
See:http://wiki.tcl.tk/2204
http://wiki.tcl.tk/tdom
Michael
I think tDOM is the best tool for web scraping. The best part of tDOM
is that it can handle well formed xml as well as html.
My blog has some sample codes that you can use that as a guideline
http://chihungchan.blogspot.com/search/label/Web%20Scraping
.
- Follow-Ups:
- Re: HTML tag Parsing and extracting data.
- From: ewilsonmail
- Re: HTML tag Parsing and extracting data.
- References:
- HTML tag Parsing and extracting data.
- From: majidkhan59
- Re: HTML tag Parsing and extracting data.
- From: schlenk
- HTML tag Parsing and extracting data.
- Prev by Date: Re: Problem VFS and packages require
- Next by Date: Re: wiki.tcl.tk/4 - great job!
- Previous by thread: Re: HTML tag Parsing and extracting data.
- Next by thread: Re: HTML tag Parsing and extracting data.
- Index(es):
Relevant Pages
|