url question - extracting (2 types of) domains



Hi,
Im trying to extract the domain name from an url. lets say I call
it full_domain and significant_domain(which is the homepage domain)

Eg: url=http://en.wikipedia.org/wiki/IPod ,
full_domain=en.wikipedia.org ,significant_domain=wikipedia.org

Using urlsplit (of urlparse module), I will be able to get the
full_domain, but Im wondering how to get significant_domain. I will
not be able to use like counting the number of dots. etc

Some domains maybe like foo.bar.co.in (where significant_domain=
bar.co.in)
I have around 40M url list. Its ok, if I fallout in few(< 1%) cases.
Although I agree that measuring this error rate itself is not clear,
maybe just based on ituition.

Anybody have clues about existing url parsers in python to do this.
Searching online couldnt help me much other than
the urlparse/urllib module.

Worst case is to try to build a table of domain
categories(like .com, .co.il etc and look for it in the suffix rather
than counting dots and just extract the part till the preceding dot),
but Im afraid if I do this, I might miss some domain category.

.



Relevant Pages

  • Re: url question - extracting (2 types of) domains
    ... Im trying to extract the domain name from an url. ... Using urlsplit (of urlparse module), I will be able to get the ... not be able to use like counting the number of dots. ... than counting dots and just extract the part till the preceding dot), ...
    (comp.lang.python)
  • Re: url question - extracting (2 types of) domains
    ... Im trying to extract the domain name from an url. ... not be able to use like counting the number of dots. ... than counting dots and just extract the part till the preceding dot), ... the full_domain and try a whois lookup. ...
    (comp.lang.python)
  • Matching points to a known grid
    ... I am taking a picture of a some screen-printed dots arranged basically ... in a grid. ... I have a simple program to extract the locations of each dot. ...
    (sci.image.processing)