Re: Question concerning this list



In <mailman.2166.1167535289.32031.python-list@xxxxxxxxxx>, Thomas Ploch
wrote:

Alright, my prof said '... to process documents written in structural
markup languages using regular expressions is a no-no.' (Because of
nested Elements? Can't remember) So I think he wants us to use regexes
to learn them. He is pointing to HTMLParser though.

Problem is that much of the HTML in the wild is written in a structured
markup language but it's in many cases broken. If you just search some
words or patterns that appear somewhere in the documents then regular
expressions are good enough. If you want to actually *parse* HTML "from
the wild" better use the BeautifulSoup_ parser.

... _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/

You are probably right. For me it boils down to these problems:
- Implementing a stack for large queues of documents which is faster
than list.pop(index) (Is there a lib for this?)

If you need a queue then use one: take a look at `collections.deque` or
the `Queue` module in the standard library.

Ciao,
Marc 'BlackJack' Rintsch
.



Relevant Pages

  • Re: Question concerning this list [WebCrawler]
    ... markup languages using regular expressions is a no-no.' ... which are then processed by a grammar-level parser. ... Using regular expressions for LALRparsing is a vice inherited ... character from string" is unreasonably expensive. ...
    (comp.lang.python)
  • Re: Turning a list of scalars into an array?
    ... file from disk, it has two lines: ... messages in queue but not yet preprocessed: ... If you must use split (and not regular expressions) and you know that the ... Using a negative index retrieves values from the end of the array. ...
    (comp.lang.perl.misc)