URLConnection and Cookies (googled already but still can not solve)?

From: Kaidi (kaidizhao_at_yahoo.com.sg)
Date: 01/16/04


Date: 16 Jan 2004 01:43:34 -0800

Hi,
(I did a google on this topic but still can not solve my problem. :-(

My problem basically is:
I am programming a crawler in Java and some sits are using cookies. As
Java does not handle cookies automatically, I find I can not access
some pages.
I read some articles such as from:
http://martin.nobilitas.com/java/cookies.html
http://www.informit.com/isapi/product_id~%7B1DF8B22B-055F-48DB-BD36-20B8017E9956%7D/content/index.asp
Basically I can see that we need to do is to get the set-cookie
header,
then write it back next time when needed.

However, when I did my test on bestbuy's home page, it seems not
working well.
Some pages seems do not ask for store cookies, but without cookie,
they can
not be accessed. One example is:
http://www.bestbuy.com/site/olspage.jsp?j=1&id=cat12074&type=page&categoryRep=cat02000

When I try to crawl this page using my Java program, it only returns a
page
saying that my brower does not support cookis. :-(
(Using IE can access it properly. In IE's option, I deleted the
cookies
before trying the above page, still works.)

Any one has any idea of this? Thanks a lot.
PS: the code I am using is from end of this page:
http://www.hccp.org/java-net-cookie-how-to.html
http://www.hccp.org/cvs/org/hccp/net/CookieManager.java
In the above code, I add a print line in storeCookies so that I can
see all the header:
........
        for (int i=1; (headerName = conn.getHeaderFieldKey(i)) !=
null; i++) {
          System.out.println("In storeCookies,
"+headerName+"-->"+conn.getHeaderField(i));
........
The headers I can see only have:

In storeCookies, Server-->Apache
In storeCookies, Last-Modified-->Mon, 24 Nov 2003 15:19:52 GMT
In storeCookies, ETag-->"b0da7d-14ee-3fc22198"
In storeCookies, Accept-Ranges-->bytes
In storeCookies, Content-Length-->5358
In storeCookies, Content-Type-->text/html
In storeCookies, Date-->Fri, 16 Jan 2004 09:37:10 GMT
In storeCookies, Connection-->keep-alive
{bestbuy.com={}}

So, since it does not have set cookies, why my Java program can not
crawl it?

For page crawling, I am using this code:
--------------
    try {
      // try opening the URL
      URL url = new URL(url_string);
      URLConnection urlConnection = url.openConnection();
      urlConnection.setAllowUserInteraction(false);
      InputStream urlStream = url.openStream();
                  // search the input stream for links
                  // first, read in the entire URL
      byte b[] = new byte[1000];
      int numRead = urlStream.read(b);
      String content;
      if (numRead > 0)
        content = new String(b, 0, numRead);
      else
        content = new String("");
      // String content = new String(b, 0, numRead);
      while ((numRead != -1) && (content.length() < MAXSIZE)) {
         numRead = urlStream.read(b);
         if (numRead != -1) {
           String newContent = new String(b, 0, numRead);
           content += newContent;
                      }
                  }
      return content;
--------------



Relevant Pages

  • Re: Cookes and Interent Explorer
    ... Maybe the web site needs some Java. ... |>>a user name/password protected website that uses cookies. ... Medium or Medium low should be OK, ...
    (microsoft.public.windowsxp.general)
  • Re: Internet explorer has lost its dynamic visuals - straight html
    ... I recently found a lot of cookies that my computer had acquired and ... There is no animation and even my yahoo email has no colour ... used to show very dynamic pages with lots of content and visuals that were perhaps java based or flash based. ... So when I go to an architectural site, instead getting a really great and well designed w/page, instead all I get is a white B/G with standard text and an animated gif...its driving me crazy. ...
    (microsoft.public.windows.inetexplorer.ie6.browser)
  • Re: Feeding The Wraith
    ... no cookies, no referrer logging, no plugins, no Java. ... information, having written the code myself, my web site should not ... The source code for that page is: ...
    (rec.arts.sf.fandom)
  • Security cookie Privacy Block All Cookies!
    ... With sun java installed and just enough enabled to be able ... with privacy block all cookies set ... Ok with cookies disabled java disabled in browser ...
    (microsoft.public.security)