Re: Extracting bolds and italics from HTML
- From: Harald <pifpafpuf@xxxxxx>
- Date: Tue, 26 Jul 2005 18:28:36 GMT
"Ezee" <ezeeonthegunzdarling@xxxxxxxxx> writes:
> Hi,
>
> I am trying to make a web crawler which will be topic focused. For
> this, I have to make some calculations on the contents of url before
> adding that url into my database.
> I had found a very useful program of Word Count from sun java forum,
> but its problem is that it also includes the HTML tags in calculation.
> Can anybody please tell me is there any Java api or online help
> available for
>
> i) A program which counts words in HTML file but doesnt include HTML
> tags.
With http://www.ebi.ac.uk/~kirsch/monq-doc/monq/programs/Grep.html
you can do things like
java monq.programs.Grep '<[^>]+>' '' '[A-Za-z]+' '%0\n' <yourhtml.html
on the command line to get fetch all words that do not below to a
tag. The mechanism behind it is
http://www.ebi.ac.uk/~kirsch/monq-doc/monq/jfa/Nfa.html which you can
use progammatically.
> ii) A program which counts only Bolds and Italics in HTML file.
This would require to look for `<b>' and `<em>' tags and can easily be
added as pattern/action pairs to the Nfa doing the word counting.
I am off to the pub now, otherwise I would've written the class, max
20 lines:-) To download the software see signature.
Harald.
--
---------------------+---------------------------------------------
Harald Kirsch (@home)|
Java Text Crunching: http://www.ebi.ac.uk/Rebholz-srv/whatizit/software
.
- References:
- Extracting bolds and italics from HTML
- From: Ezee
- Extracting bolds and italics from HTML
- Prev by Date: jaxb use of AnyType
- Next by Date: Use filter to change a JSP's HTML?
- Previous by thread: Extracting bolds and italics from HTML
- Next by thread: n-gram based & edit distance based comparisons
- Index(es):
Relevant Pages
|