Re: command-line search engine query
From: Arthur J. O'Dwyer (ajo_at_nospam.andrew.cmu.edu)
Date: 07/23/04
- Next message: Programmer Dude: "Re: Public Service Announcement"
- Previous message: Thomas Gagne: "Re: Static vs. Dynamic typing (big advantage or not)---WAS: c.programming: OOP and memory management"
- In reply to: igthibau: "command-line search engine query"
- Next in thread: igthibau: "Re: command-line search engine query"
- Reply: igthibau: "Re: command-line search engine query"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Fri, 23 Jul 2004 10:15:57 -0400 (EDT)
On Fri, 23 Jul 2004, igthibau wrote:
>
> Ok, I wish to know, without GUI (under linux), how to query a given search
> engine to know the position of a web site.
> Example :
> query google for www.nasa.org -> reply 4
> so google has www.nasa.org as its 4th entry
You realize, of course, that Google doesn't "rank" sites this way.
You have to actually enter a search term or phrase in the box, and then
Google will give you a list of sites containing that term. You can't
"query Google for www.nasa.org" (unless I've been missing something!).
But I'll assume Jens is right and you meant to query for, let's say,
"saturn," and see where NASA came up in the results.
> Also, no point sending pieces of code since these are only useful in
> a very specific environment (yours).
Well, if you don't want code, and you don't want to learn network
programming yourself, we're kind of in a bind, aren't we? :) But
luckily we /know/ what platform you use: Linux. So you can use any
source code that's guaranteed to work on Linux. For example, this
ksh one-liner:
% lynx -source 'http://www.google.com/search?q=saturn&num=100' | grep
'^<br><font ' | grep -n 'nasa' | pr -W 72 -Tt
5:<br><font color=#008000>www.<b>saturn</b>.de/ - 3k - </font><nobr> <
6:<br><font color=#008000><b>saturn</b>.jpl.nasa.gov/home/index.cfm - 4
7:<br><font color=#008000><b>saturn</b>.jpl.nasa.gov/index.cfm - 43k -
16:<br><font color=#008000>www.dpo.uab.edu/~moudry/ - 8k - </font><nobr
17:<br><font color=#008000>pds.jpl.nasa.gov/planets/welcome/mars.htm -
18:<br><font color=#008000>pds.jpl.nasa.gov/planets/choices/<b>saturn</b
45:<br><font color=#008000>www.esa.int/SPECIALS/Cassini-Huygens/ - 42k
46:<br><font color=#008000>www.nasa.gov/mission_pages/cassini/main/ - 4
47:<br><font color=#008000>nssdc.gsfc.nasa.gov/photo_gallery/ photogalle
48:<br><font color=#008000>nssdc.gsfc.nasa.gov/planetary/planets/<b>satu
51:<br><font color=#008000>www.kuro5hin.org/story/2004/7/1/93459/66714 -
52:<br><font color=#008000>ringmaster.arc.nasa.gov/<b>saturn</b>/<b>satu
59:<br><font color=#008000>www.mindspring.com/~dhanon/home.htm - 3k - <
60:<br><font color=#008000>science.nasa.gov/headlines/y2004/09jul_hailst
61:<br><font color=#008000>science.nasa.gov/headlines/y2002/13dec_<b>sat
62:<br><font color=#008000>starchild.gsfc.nasa.gov/docs/ StarChild/solar
71:<br><font color=#008000>www.cosmicelk.co.uk/Saturn.htm - 1k - Jul 2
%
I'll break it down for you.
lynx -source 'http://www.google.com/search?q=saturn&num=100'
This fetches the Google results page for a query on "saturn". Replace
"saturn" with your own search phrase, using periods or plusses to
separate words in quotes or out of quotes, e.g. searching on "foo"
would be "q=foo", searching on "foo bar" would be "q=foo+bar", and
searching on ""foo bar"" would be "q=foo.bar". We return the top
100 results.
| grep '^<br><font '
This looks for a special combination of tags that we're going to
assume Google prepends to all search results. This is ad-hoc and
not guaranteed to work (it's certainly not documented anywhere!),
but it works this week. ;) Might require some hacking in a month
or two, but oh well. If you want robustness, you're just going to
have to use an official Google API, or learn Perl.
| grep -n 'nasa'
This takes all those search result lines and finds the ones containing
the word 'nasa' (in your case, perhaps 'www.nasa.org' would do better;
this will also return results from e.g. JPL). It returns those lines,
and also prepends their line numbers (1 for the first line of the
Google results, 2 for the second, and so on). NB: If a part of the
'nasa' string is the same as or similar to a part of the 'saturn'
string, beware! You'll need to account for the fact that Google
marks up parts of its output like this: 'www.<b>saturn</b>.de'. I'll
leave that to the regex gurus.
| pr -W 72 -Tt
This takes all /those/ lines and truncates them to 72 characters,
making the output easier to read on a terminal display. Change the
'72' to something bigger if you like.
Now, you'll notice that this still returns a bunch of false positives,
but (1) using 'www.nasa.org' instead of just 'nasa' will partially fix
that, and (2) you'll probably be having a human look at this anyway,
won't you? And (3) learn Perl.
> Now, this is pretty simple stuff (for computer oriented people), it must be,
> and so should the answer. What I would expect would be along the lines of :
> run such command : command www.google.com www.nasa.org > output.txt
> then search thought output.txt for www.nasa.org, the line it on is its
> position.
Two words: Shell script. I'm no Linux expert; I'll leave that to
someone else too.
HTH,
-Arthur
- Next message: Programmer Dude: "Re: Public Service Announcement"
- Previous message: Thomas Gagne: "Re: Static vs. Dynamic typing (big advantage or not)---WAS: c.programming: OOP and memory management"
- In reply to: igthibau: "command-line search engine query"
- Next in thread: igthibau: "Re: command-line search engine query"
- Reply: igthibau: "Re: command-line search engine query"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|