Is there a module to spider Google usenet archives yet?

From: ~greg (g_m_at_(remove_to_reply)comcast.net)
Date: 02/12/04

  • Next message: Gerard Lanois: "Re: MIME::Entity , decoding Base64 image/jpg"
    Date: Wed, 11 Feb 2004 23:47:47 -0500
    
    

    Hello.
    Is there a module to spider Google usenet archives yet?

    I believe that subject line question
    has been asked before, and answered
    in the negative, which is why I say "yet".

    I just want to get to all the posts to a
    particular usenet group, in their original
    usenet formats. For personal use only.
    And respecting all robot rules -if possible ;)
    In any case, there's no hurry. Just a few
    hindered per day with plenty of delays
    between requests is fine with me, if that's
    how they (-the Google team) prefers it be
    done.

    I routinely write hacks to harvest sets of pages.
    However, my knowledge of true spidering
    is very limited.

    The only way I can think of doing it manually
    is like this:

       Starting from Google Advanced Groups Search,
    set the number of messages to 100 and sort by date.

    Name the newsgroup.

    Then a reasonable date limits.
    (The date limits will be incremented in
    overlapping steps, and the redundant links
    returned will later be eliminated by script.).

    The first returned pages seems to list
    just thread-starting posts.
    (And even to get all of them you need to
    follow the "Next # threads" links.)

    Clicking on the thread-starting posts
    links has different consequences,
    depending on whether there is one
    or more than one post in the thread.

    If there is more than one post in the thread,
    then the link leads to a frames set, with a
    tree-structure on the left listing the posts
    in the thread, and one particular post,
    in html, on the right.

    >From the html of the post you next have
    to click "View This Article Only".

    And then, finally, you click on the
    "Original Format" link to get to the
    usenet format, - which is the objective.

    The question is, -how can all that be automated?
    Or is there a better way? Or is there a module
    (yet) that does it?

    (Incidentally, I was one of the -probably many
    -who wrote to Google requesting that they
    provide access to the original usenet format,
    at a time when they didn't. They eventually
    wrote back to me: "Thanks for the suggestion!
    We really appreciate thoughtful feedback
    from our users, and we'll keep it in mind
    as we grow and evolve. Regards,
    The Google Team". And the access was
    provided very shortly after that!)

    ~Greg.


  • Next message: Gerard Lanois: "Re: MIME::Entity , decoding Base64 image/jpg"

    Relevant Pages

    • Re: RAID 1
      ... >> repeated context that also clutters the archives, ... I suppose I should have given up on usenet ... he didn't have google to help sift through the cruft and probably ... It shows parts of your 100 most recent posts with links to the full text ...
      (comp.os.linux.networking)
    • Re: Troll Activity Going Through The Roof!!! Amazing!!!
      ... Google has permitted large numbers of individuals to access usenet who ... this thread is a person who, if you run a search on all his posts in this ... posting volume through the eyes of the Google interface, ...
      (alt.sports.football.pro.ne-patriots)
    • Re: What does it mean when a legitimate business uses a "private" IP address?
      ... On Fri, 2 Jan 2009, in the Usenet newsgroup alt.internet.wireless, in article ... Rude or insulting posts - I don't see many of those either, ... That browser you are using can be used. ... The main complaint about google _users_ is spam. ...
      (alt.internet.wireless)
    • Re: iso image
      ... On Fri, 05 Oct 2007, in the Usenet newsgroup comp.os.linux, in article ... news server, not one of the web interfaces. ... posts from six months to a year earlier. ... Google also intentionally ignores all abuse reports, ...
      (comp.os.linux)
    • Re: Why the complaints about Google Usenet ? [telecom]
      ... Google Groups, a service where many people, including me, post to ... when it comes to Usenet posting, ... The address for email submissions has changed: if you submit posts via email, ... If you submit posts via a newsreader or Google groups, you don't need to change anything. ...
      (comp.dcom.telecom)