Re: Question concerning this list
- From: Marc 'BlackJack' Rintsch <bj_666@xxxxxxx>
- Date: Sun, 31 Dec 2006 11:30:05 +0100
In <mailman.2166.1167535289.32031.python-list@xxxxxxxxxx>, Thomas Ploch
wrote:
Alright, my prof said '... to process documents written in structural
markup languages using regular expressions is a no-no.' (Because of
nested Elements? Can't remember) So I think he wants us to use regexes
to learn them. He is pointing to HTMLParser though.
Problem is that much of the HTML in the wild is written in a structured
markup language but it's in many cases broken. If you just search some
words or patterns that appear somewhere in the documents then regular
expressions are good enough. If you want to actually *parse* HTML "from
the wild" better use the BeautifulSoup_ parser.
... _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
You are probably right. For me it boils down to these problems:
- Implementing a stack for large queues of documents which is faster
than list.pop(index) (Is there a lib for this?)
If you need a queue then use one: take a look at `collections.deque` or
the `Queue` module in the standard library.
Ciao,
Marc 'BlackJack' Rintsch
.
- Follow-Ups:
- Re: Question concerning this list [WebCrawler]
- From: Thomas Ploch
- Re: Question concerning this list [WebCrawler]
- References:
- Question concerning this list
- From: Thomas Ploch
- Re: Question concerning this list
- From: Steven D'Aprano
- Re: Question concerning this list
- From: Thomas Ploch
- Question concerning this list
- Prev by Date: Re: python , Boost and straight (but complex) C code
- Next by Date: Re: DOS, UNIX and tabs
- Previous by thread: Re: Question concerning this list
- Next by thread: Re: Question concerning this list [WebCrawler]
- Index(es):
Relevant Pages
|