Re: command-line search engine query

From: Arthur J. O'Dwyer (ajo_at_nospam.andrew.cmu.edu)
Date: 07/23/04


Date: Fri, 23 Jul 2004 10:15:57 -0400 (EDT)


On Fri, 23 Jul 2004, igthibau wrote:
>
> Ok, I wish to know, without GUI (under linux), how to query a given search
> engine to know the position of a web site.
> Example :
> query google for www.nasa.org -> reply 4
> so google has www.nasa.org as its 4th entry

   You realize, of course, that Google doesn't "rank" sites this way.
You have to actually enter a search term or phrase in the box, and then
Google will give you a list of sites containing that term. You can't
"query Google for www.nasa.org" (unless I've been missing something!).

   But I'll assume Jens is right and you meant to query for, let's say,
"saturn," and see where NASA came up in the results.

> Also, no point sending pieces of code since these are only useful in
> a very specific environment (yours).

   Well, if you don't want code, and you don't want to learn network
programming yourself, we're kind of in a bind, aren't we? :) But
luckily we /know/ what platform you use: Linux. So you can use any
source code that's guaranteed to work on Linux. For example, this
ksh one-liner:

% lynx -source 'http://www.google.com/search?q=saturn&num=100' | grep
'^<br><font ' | grep -n 'nasa' | pr -W 72 -Tt
5:<br><font color=#008000>www.<b>saturn</b>.de/ - 3k - </font><nobr> <
6:<br><font color=#008000><b>saturn</b>.jpl.nasa.gov/home/index.cfm - 4
7:<br><font color=#008000><b>saturn</b>.jpl.nasa.gov/index.cfm - 43k -
16:<br><font color=#008000>www.dpo.uab.edu/~moudry/ - 8k - </font><nobr
17:<br><font color=#008000>pds.jpl.nasa.gov/planets/welcome/mars.htm -
18:<br><font color=#008000>pds.jpl.nasa.gov/planets/choices/<b>saturn</b
45:<br><font color=#008000>www.esa.int/SPECIALS/Cassini-Huygens/ - 42k
46:<br><font color=#008000>www.nasa.gov/mission_pages/cassini/main/ - 4
47:<br><font color=#008000>nssdc.gsfc.nasa.gov/photo_gallery/ photogalle
48:<br><font color=#008000>nssdc.gsfc.nasa.gov/planetary/planets/<b>satu
51:<br><font color=#008000>www.kuro5hin.org/story/2004/7/1/93459/66714 -
52:<br><font color=#008000>ringmaster.arc.nasa.gov/<b>saturn</b>/<b>satu
59:<br><font color=#008000>www.mindspring.com/~dhanon/home.htm - 3k - <
60:<br><font color=#008000>science.nasa.gov/headlines/y2004/09jul_hailst
61:<br><font color=#008000>science.nasa.gov/headlines/y2002/13dec_<b>sat
62:<br><font color=#008000>starchild.gsfc.nasa.gov/docs/ StarChild/solar
71:<br><font color=#008000>www.cosmicelk.co.uk/Saturn.htm - 1k - Jul 2
%

   I'll break it down for you.

     lynx -source 'http://www.google.com/search?q=saturn&num=100'

This fetches the Google results page for a query on "saturn". Replace
"saturn" with your own search phrase, using periods or plusses to
separate words in quotes or out of quotes, e.g. searching on "foo"
would be "q=foo", searching on "foo bar" would be "q=foo+bar", and
searching on ""foo bar"" would be "q=foo.bar". We return the top
100 results.

     | grep '^<br><font '

This looks for a special combination of tags that we're going to
assume Google prepends to all search results. This is ad-hoc and
not guaranteed to work (it's certainly not documented anywhere!),
but it works this week. ;) Might require some hacking in a month
or two, but oh well. If you want robustness, you're just going to
have to use an official Google API, or learn Perl.

     | grep -n 'nasa'

This takes all those search result lines and finds the ones containing
the word 'nasa' (in your case, perhaps 'www.nasa.org' would do better;
this will also return results from e.g. JPL). It returns those lines,
and also prepends their line numbers (1 for the first line of the
Google results, 2 for the second, and so on). NB: If a part of the
'nasa' string is the same as or similar to a part of the 'saturn'
string, beware! You'll need to account for the fact that Google
marks up parts of its output like this: 'www.<b>saturn</b>.de'. I'll
leave that to the regex gurus.

     | pr -W 72 -Tt

This takes all /those/ lines and truncates them to 72 characters,
making the output easier to read on a terminal display. Change the
'72' to something bigger if you like.

   Now, you'll notice that this still returns a bunch of false positives,
but (1) using 'www.nasa.org' instead of just 'nasa' will partially fix
that, and (2) you'll probably be having a human look at this anyway,
won't you? And (3) learn Perl.

> Now, this is pretty simple stuff (for computer oriented people), it must be,
> and so should the answer. What I would expect would be along the lines of :
> run such command : command www.google.com www.nasa.org > output.txt
> then search thought output.txt for www.nasa.org, the line it on is its
> position.

   Two words: Shell script. I'm no Linux expert; I'll leave that to
someone else too.

HTH,
-Arthur



Relevant Pages

  • Re: Cheap Access to Space
    ... From a FOIA I tried filling with NASA: ... supersymmeric theories which generalise and posulate a Fermion a ... The Military would like you to believe that its technology was ... technology in the world today is probably Google. ...
    (sci.space.policy)
  • Re: [SLE] Picassa for Linux
    ... With installing picasa you'll have to accept a long, ... Google might know my IP???? ... What does that have to do with Linux, ... Seriously, though, companies should make OSS - that is a notion I agree with. ...
    (SuSE)
  • Re: Distro Poll, what do you use?
    ... Google, but only one was decidedly effective. ... out trolls and idiots for what they are)? ... just accurately identifying newbies and trolls. ... Imagine my thrill as we were watching Linux come to life out of thin air ...
    (alt.os.linux)
  • Re: How would the God of Standard Sql - Celko do server side paging?
    ... I think that first my query is checked against other previously ... A lot of people Google "<movie star ... Then using the same model as a report server, ... If I ask a truly original query of Google, ...
    (microsoft.public.sqlserver.programming)
  • Re: ~article~ Are you Googles gopher?
    ... Google has just taken on legions of new workers. ... Aren't you using Windows 98 SE? ... And lets face it, since Linux aint used by the majority, it will be ... The short life and hard times of a Linux virus ...
    (alt.internet.search-engines)