Re: How to extract links

From: Aleksander Strączek (olek_at_plusnet.pl)
Date: 01/30/05


Date: Sun, 30 Jan 2005 03:08:27 +0000 (UTC)

W artykule <UNGdnXQS1YUwomHcRVn-iA@comcast.com> hilz napisal(a):

>> Hi ,
>>
>> I am trying to extract all the hyper links in a google result page to a
>> file, using java.
>>
>> i.e when I search for say, "JAVA" in google, i have to capture all the
>> resulting links for "JAVA" in to a file, using java.
>>
>> Can anyone help me on this, and tell me how to start with and how to do
>> the extraction.
>>
>> This is very important to me.
>> Your help is highly appreciated.
>> Thanks in advance.
>
> To start, look at the java.net package, specifically the URL class , and the
> URLConnection interface. These will help you connect to the URL and get the
> text of the page.
> You probably also need to use a java.util.regex package to find the links
> in that page.
>
> If you show your code, and have more specific questions, you will get better
> answers.

URL class from java.net package doesn't work for me (403 - google protection).
I suggest use httpclient to get results from google,
than htmlunit to easy extract links (http://htmlunit.sourceforge.net).

Here is working sample:
(to run it see http://htmlunit.sourceforge.net/dependencies.html).

import java.io.FileWriter;
import java.io.PrintWriter;
import java.net.URL;
import java.util.Iterator;
import java.util.List;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class LinkExtract {

    private static final String FILE_NAME = "results.txt";

    private static final String QUERY = "JAVA";

    public static void main(String[] args) {

        PrintWriter printWriter = null;
        try {
            WebClient wc = new WebClient();
            URL url = new URL("http://www.google.com/search?q=" + QUERY);
            HtmlPage page = (HtmlPage) wc.getPage(url);
            printWriter = new PrintWriter(new FileWriter(FILE_NAME));
            List anchors = page.getAnchors();
            for (Iterator iter = anchors.iterator(); iter.hasNext();) {
                HtmlAnchor anchor = (HtmlAnchor) iter.next();
                if (isSkipLink(anchor)) {
                    continue;
                }
                printWriter.println(anchor.getHrefAttribute());
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            if (printWriter != null) {
                printWriter.close();
            }
        }
    }

    /**
     * Decide if this link has to be processed.
     *
     * @param anchor
     * link
     * @return true if link has to be omitted, false if is to be processed
     */
    private static boolean isSkipLink(HtmlAnchor anchor) {

        return anchor.getHrefAttribute().startsWith("/")
                || anchor.getHrefAttribute().indexOf("/search?q=cache:") > 0;
    }
}

-- 
HTH, Olek


Relevant Pages

  • Re: How to extract links
    ... > I am trying to extract all the hyper links in a google result page to a ... To start, look at the java.net package, specifically the URL class, and the ...
    (comp.lang.java.programmer)
  • Re: SheBlewHimDidYouBlowHim & Mary Walker
    ... any worthwhile extract from the DNA project. ... Google can remedy your misconception ... Showing Noah's non existince, ... This seems to be a contradiction on a massive scale and begs ...
    (alt.religion.christian)
  • Re: Maple 8 | subs and matrices
    ... I have had no such trouble with other groups. ... There is always a several-hour delay with Google ... please give MapleSoft credit where credit is ... the old package with its idiosyncracies must remain. ...
    (sci.math.symbolic)
  • Re: How do I install a bin file?
    ... seemed to install would never show up anywhere as an app to run. ... Why not install the deb package? ... Description: Google Earth! ... This package contains binary files for Google Earth! ...
    (Ubuntu)
  • Re: How do I install a bin file?
    ... seemed to install would never show up anywhere as an app to run. ... Why not install the deb package? ... Description: Google Earth! ... This package contains binary files for Google Earth! ...
    (Ubuntu)