Re: HTML tag Parsing and extracting data.



On Oct 29, 9:45 pm, schl...@xxxxxxxxxxxxxxxx wrote:
majidkha...@xxxxxxxxx wrote:
Hi,

I am new in TCL . Let me tell you what I am want to do which so far I
am trying to but failed.

I would like to parse an HTML page say it is .

http://www.cmcelectronics.ca/En/Careers/job_display_en.php?JOB_ID=511
or
http://www.sanjel.com/careers/jobDesc.cfm?numJobBoardID=440

and I want to extract the data/info of "Duties & Responsibilities:" ,
"Description:" , "Summary" or "Responsibilities" and etc etc..
So I am looking for the code which should be generic enough in a sense
that if we pass "descriptions" or "description" or any keyword whic I
mentioned above or could be any then it looks and extract the
information related to that keyword or heading..

Basically 'generic' is hard in this regard due to the way HTML turned
into a tag soup and isn't properly annotated at all (even harder if
javascript is involved).

A simple sledgehammer would be regexp..., a little more sophisticated
something like tcllib htmlparse or tdom in html mode. Even tclwebtest
might be helpful.

See:http://wiki.tcl.tk/2204
http://wiki.tcl.tk/tdom

Michael

I think tDOM is the best tool for web scraping. The best part of tDOM
is that it can handle well formed xml as well as html.

My blog has some sample codes that you can use that as a guideline
http://chihungchan.blogspot.com/search/label/Web%20Scraping


.



Relevant Pages

  • Re: html parsing
    ... section I am interested in starts with a html header, ... I am experimenting with tDom but, it is hard to see what I should look ...
    (comp.lang.tcl)
  • May I miss something with the http package - problems with data loss via http::geturl?
    ... I use http::geturl to retrieve HTML data to be parsed via XPath ... expressions (in tdom). ... Most of the HTML data/pages are ok after retrieving, ...
    (comp.lang.tcl)
  • Re: Convert Xml to Html
    ... I am already using tdom. ... Does it generate html from a parsed xml document? ... DrS ...
    (comp.lang.tcl)
  • Re: getElementsByName() - opera x firefox
    ... The DOCTYPE declaration is missing before that. ... and missing here. ... tag for the `head' element is defined to be optional in HTML, ... tag soup itself is not correct at all. ...
    (comp.lang.javascript)
  • Re: IE and Geocities
    ... XHTML what MSIE has a problem to begin with. ... when it is in quirks mode, with HTML then when it parses XHTML as tag soup. ... Tab row, or if you hit CTRL+T (the keyboard shortcut that does the ...
    (comp.infosystems.www.authoring.stylesheets)