Re: building a meta search engine



Thanks for the reply. I am aware of the Java capabilities and how this
task is done in theory. I am more interested in performance and
practical issues. Parsing HTML is not a lightweight task, so languages
like PHP or Perl are hardly suitable for it. I don't want to touch C, so
Java seems to me like the best option... Another thing is that, coming
up with a parser from scratch seems like an awful waste of time to me,
considering languages like Python come up with a basic XML/HTML parser
out of the box. Are there any good Java parsers out there? Things like
POST/GET requests, support for multiple results pages, high degree of
customization would be nice


Roman

deep wrote:
java is perfectly suitable to design meta search engine.
read the html page and match every content and fetch the url and again
with that url open the page and read again and match with search
item...follow a loop....
and finally listed out the url....

there are no standard API for this process.

RoS wrote:
Hello there,

I am building a web application, which involves submitting search
queries to a number of sites, processing and parsing search results and
returning them in an organized way. Basically, a meta search engine. As
there are no search APIs for those sites nor I can access their
databases, I'll have to process raw HTML files and build an unique
parser for each site. As an underlying platform I use J2EE, Servlets and
Tomcat.

- Are there any ready-made Java open-source packages that would deal
with the task of handling POST/GET requests, parsing HTML and organizing
data?
- Is Java a suitable choice for this task? I was originally planning to
use PHP (mostly because I'd like to learn it), but considering this task
is quite CPU incentive, I opted for Java. Python is another viable option,
- Does parsing HTML files seem feasible at all? Considering a single
change in the target site search page structure would require changes to
its parser, this approach looks painful. But on the other hand I have no
idea about an alternative solution, other than bugging site owners for
granting database access or building a simple search API (on the second
thoughts this approach seems to be even more painful)

Any thoughts/comments on the subject are greatly appreciated.


Cheers,
Roman

.



Relevant Pages

  • Re: building a meta search engine
    ... Parsing HTML is not a lightweight task, so languages ... up with a parser from scratch seems like an awful waste of time to me, ... Are there any good Java parsers out there? ...
    (comp.lang.java.help)
  • Re: Regex questions suggestions.
    ... C comments closely resemble Java comments. ... Then there are parser generators. ... own code to a custom generated parser. ... I am usually dealing with imperfect syntax. ...
    (comp.lang.java.help)
  • Re: mutate an object or create a new one?
    ... Pass a huge list of parameters to the constructor -- not nice. ... all of that responsibility to parser objects which it used internally. ... The Java project is also my personal one only, ... modify Jikes RVM then (which I don't think support annotation yet). ...
    (comp.lang.java.programmer)
  • Re: Informationen aus Java Quellcode per Regexp
    ... >> Aber ich stimmen allen zu, die behaupten, dass ein richtiger Parser die ... > als erstes "RegExp" durch den Kopf geht) scheint mir eher das umgekehrte ... Je kleiner die AUfgabe desto besser geht es mit regex. ... BeanShell hat doch einen Java Parser eingebaut (wobei der wohl bischen ...
    (de.comp.lang.java)
  • Re: HTML-Seite parsen in Java??
    ... Jedoch bei der Anzeige nur die Überschrift als Link ... Meine Frage: Gibt es einen Parser, ... Java verfügt allerdings über Klassen, die das Parsen von HTML-Dokumenten ...
    (de.comp.lang.java)