Re: HTML tag Parsing and extracting data.




majidkha...@xxxxxxxxx wrote:
Hi,

I am new in TCL . Let me tell you what I am want to do which so far I
am trying to but failed.

I would like to parse an HTML page say it is .

http://www.cmcelectronics.ca/En/Careers/job_display_en.php?JOB_ID=511
or
http://www.sanjel.com/careers/jobDesc.cfm?numJobBoardID=440

and I want to extract the data/info of "Duties & Responsibilities:" ,
"Description:" , "Summary" or "Responsibilities" and etc etc..
So I am looking for the code which should be generic enough in a sense
that if we pass "descriptions" or "description" or any keyword whic I
mentioned above or could be any then it looks and extract the
information related to that keyword or heading..

Basically 'generic' is hard in this regard due to the way HTML turned
into a tag soup and isn't properly annotated at all (even harder if
javascript is involved).

A simple sledgehammer would be regexp..., a little more sophisticated
something like tcllib htmlparse or tdom in html mode. Even tclwebtest
might be helpful.

See: http://wiki.tcl.tk/2204
http://wiki.tcl.tk/tdom

Michael

.



Relevant Pages

  • Re: TCL can do it for me?
    ... Well, HTML is just strings and Tcl is all strings so, yes, Tcl can ... nothing about barcode printers, etc. ...
    (comp.lang.tcl)
  • Re: keypress
    ... looked like TCL script between script tags. ... but I presume you can use TCL from HTML. ... side or client-side processing) either a Tcl-enabled webserver or the ... with Javascript/VBScript or another language for client side HTML? ...
    (comp.lang.tcl)
  • Re: standart browser
    ... >On Sun, 6 Nov 2005, eiji wrote: ... >> I would like to view some html pages via tcl. ... For example some help-text ...
    (comp.lang.tcl)
  • Re: keypress
    ... looked like TCL script between script tags. ... but I presume you can use TCL from HTML. ... expect because Scroll Lock is a special key. ... with Javascript/VBScript or another language for client side HTML? ...
    (comp.lang.tcl)
  • Re: HTML tag Parsing and extracting data.
    ... Basically 'generic' is hard in this regard due to the way HTML turned ... into a tag soup and isn't properly annotated at all (even harder if ... I think tDOM is the best tool for web scraping. ...
    (comp.lang.tcl)