Re: Extracting bolds and italics from HTML



"Ezee" <ezeeonthegunzdarling@xxxxxxxxx> writes:

> Hi,
>
> I am trying to make a web crawler which will be topic focused. For
> this, I have to make some calculations on the contents of url before
> adding that url into my database.
> I had found a very useful program of Word Count from sun java forum,
> but its problem is that it also includes the HTML tags in calculation.
> Can anybody please tell me is there any Java api or online help
> available for
>
> i) A program which counts words in HTML file but doesnt include HTML
> tags.

With http://www.ebi.ac.uk/~kirsch/monq-doc/monq/programs/Grep.html
you can do things like

java monq.programs.Grep '<[^>]+>' '' '[A-Za-z]+' '%0\n' <yourhtml.html

on the command line to get fetch all words that do not below to a
tag. The mechanism behind it is
http://www.ebi.ac.uk/~kirsch/monq-doc/monq/jfa/Nfa.html which you can
use progammatically.

> ii) A program which counts only Bolds and Italics in HTML file.

This would require to look for `<b>' and `<em>' tags and can easily be
added as pattern/action pairs to the Nfa doing the word counting.

I am off to the pub now, otherwise I would've written the class, max
20 lines:-) To download the software see signature.

Harald.

--
---------------------+---------------------------------------------
Harald Kirsch (@home)|
Java Text Crunching: http://www.ebi.ac.uk/Rebholz-srv/whatizit/software
.



Relevant Pages

  • Re: Html to Text Convertor?
    ... piece of code that removes all tags from an HTML file. ... Take a look at the Web Browser Control. ... MVP Tips:http://www.flounder.com/mvp_tips.htm ...
    (microsoft.public.vc.mfc)
  • Re: Html to Text Convertor?
    ... piece of code that removes all tags from an HTML file. ... Take a look at the Web Browser Control. ... MVP Tips:http://www.flounder.com/mvp_tips.htm ...
    (microsoft.public.vc.mfc)
  • Re: Problem page IE clear float problem, Opera/FF header problem and N4
    ... > caps (which doesn't work so well with css). ... > be missing quotes or tags but those quotes and tags are already there, ... Yes but you also have to change your CSS file, an id is prefixed with a # ... to id="mainimage" in your HTML file but that they are still in your CSS ...
    (comp.infosystems.www.authoring.stylesheets)
  • Re: Java how to program errata
    ... > Now I made a change: the name of the file is Lab2.java, the html file ... > The contents of my Java file is: ... This time the error message suggests that it can not find the ... So I have tried twice, and with all the fussing around, feels ...
    (comp.lang.java)
  • Extracting bolds and italics from HTML
    ... I have to make some calculations on the contents of url before ... I had found a very useful program of Word Count from sun java forum, ... but its problem is that it also includes the HTML tags in calculation. ... A program which counts words in HTML file but doesnt include HTML ...
    (comp.lang.java.programmer)