Re: Question concerning this list [WebCrawler]
- From: Thomas Ploch <Thomas.Ploch@xxxxxxx>
- Date: Sun, 31 Dec 2006 12:15:05 +0100
Marc 'BlackJack' Rintsch schrieb:
In <mailman.2166.1167535289.32031.python-list@xxxxxxxxxx>, Thomas Ploch
wrote:
Alright, my prof said '... to process documents written in structural
markup languages using regular expressions is a no-no.' (Because of
nested Elements? Can't remember) So I think he wants us to use regexes
to learn them. He is pointing to HTMLParser though.
Problem is that much of the HTML in the wild is written in a structured
markup language but it's in many cases broken. If you just search some
words or patterns that appear somewhere in the documents then regular
expressions are good enough. If you want to actually *parse* HTML "from
the wild" better use the BeautifulSoup_ parser.
.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
Yes, I know about BeautifulSoup. But as I said it should be done with
regexes. I want to extract tags, and their attributes as a dictionary of
name/value pairs. I know that most of HTML out there is *not* validated
and bollocks.
This is how my regexes look like:
import re
class Tags:
def __init__(self, sourceText):
self.source = sourceText
self.curPos = 0
self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*"
self.tagPattern = re.compile("<(?P<name>%s)(?P<attr>[^>]*)>"
% self.namePattern)
self.attrPattern = re.compile(
r"\s+(?P<attrName>%s)\s*=\s*(?P<value>\"[^\"]*\"|'[^']*')"
% self.namePattern)
You are probably right. For me it boils down to these problems:
- Implementing a stack for large queues of documents which is faster
than list.pop(index) (Is there a lib for this?)
If you need a queue then use one: take a look at `collections.deque` or
the `Queue` module in the standard library.
Which of the two would you recommend for handling large queues with fast
response times?
Thomas
.
- Follow-Ups:
- Re: Question concerning this list [WebCrawler]
- From: Marc 'BlackJack' Rintsch
- Re: Question concerning this list [WebCrawler]
- References:
- Question concerning this list
- From: Thomas Ploch
- Re: Question concerning this list
- From: Steven D'Aprano
- Re: Question concerning this list
- From: Thomas Ploch
- Re: Question concerning this list
- From: Marc 'BlackJack' Rintsch
- Question concerning this list
- Prev by Date: Re: DOS, UNIX and tabs
- Next by Date: Re: Progress Box or Bar in Windows
- Previous by thread: Re: Question concerning this list
- Next by thread: Re: Question concerning this list [WebCrawler]
- Index(es):