Re: How to extract links
From: Aleksander Strączek (olek_at_plusnet.pl)
Date: 01/30/05
- Next message: Bjorn Abelli: "Re: Java and SOAP"
- Previous message: hilz: "Re: How to extract links"
- In reply to: hilz: "Re: How to extract links"
- Next in thread: Andrey Kuznetsov: "Re: How to extract links"
- Reply: Andrey Kuznetsov: "Re: How to extract links"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sun, 30 Jan 2005 03:08:27 +0000 (UTC)
W artykule <UNGdnXQS1YUwomHcRVn-iA@comcast.com> hilz napisal(a):
>> Hi ,
>>
>> I am trying to extract all the hyper links in a google result page to a
>> file, using java.
>>
>> i.e when I search for say, "JAVA" in google, i have to capture all the
>> resulting links for "JAVA" in to a file, using java.
>>
>> Can anyone help me on this, and tell me how to start with and how to do
>> the extraction.
>>
>> This is very important to me.
>> Your help is highly appreciated.
>> Thanks in advance.
>
> To start, look at the java.net package, specifically the URL class , and the
> URLConnection interface. These will help you connect to the URL and get the
> text of the page.
> You probably also need to use a java.util.regex package to find the links
> in that page.
>
> If you show your code, and have more specific questions, you will get better
> answers.
URL class from java.net package doesn't work for me (403 - google protection).
I suggest use httpclient to get results from google,
than htmlunit to easy extract links (http://htmlunit.sourceforge.net).
Here is working sample:
(to run it see http://htmlunit.sourceforge.net/dependencies.html).
import java.io.FileWriter;
import java.io.PrintWriter;
import java.net.URL;
import java.util.Iterator;
import java.util.List;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class LinkExtract {
private static final String FILE_NAME = "results.txt";
private static final String QUERY = "JAVA";
public static void main(String[] args) {
PrintWriter printWriter = null;
try {
WebClient wc = new WebClient();
URL url = new URL("http://www.google.com/search?q=" + QUERY);
HtmlPage page = (HtmlPage) wc.getPage(url);
printWriter = new PrintWriter(new FileWriter(FILE_NAME));
List anchors = page.getAnchors();
for (Iterator iter = anchors.iterator(); iter.hasNext();) {
HtmlAnchor anchor = (HtmlAnchor) iter.next();
if (isSkipLink(anchor)) {
continue;
}
printWriter.println(anchor.getHrefAttribute());
}
} catch (Exception e) {
throw new RuntimeException(e);
} finally {
if (printWriter != null) {
printWriter.close();
}
}
}
/**
* Decide if this link has to be processed.
*
* @param anchor
* link
* @return true if link has to be omitted, false if is to be processed
*/
private static boolean isSkipLink(HtmlAnchor anchor) {
return anchor.getHrefAttribute().startsWith("/")
|| anchor.getHrefAttribute().indexOf("/search?q=cache:") > 0;
}
}
-- HTH, Olek
- Next message: Bjorn Abelli: "Re: Java and SOAP"
- Previous message: hilz: "Re: How to extract links"
- In reply to: hilz: "Re: How to extract links"
- Next in thread: Andrey Kuznetsov: "Re: How to extract links"
- Reply: Andrey Kuznetsov: "Re: How to extract links"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|