Re: Question concerning this list [WebCrawler]
- From: Marc 'BlackJack' Rintsch <bj_666@xxxxxxx>
- Date: Sun, 31 Dec 2006 13:54:37 +0100
In <mailman.2169.1167563637.32031.python-list@xxxxxxxxxx>, Thomas Ploch
wrote:
This is how my regexes look like:
import re
class Tags:
def __init__(self, sourceText):
self.source = sourceText
self.curPos = 0
self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*"
self.tagPattern = re.compile("<(?P<name>%s)(?P<attr>[^>]*)>"
% self.namePattern)
self.attrPattern = re.compile(
r"\s+(?P<attrName>%s)\s*=\s*(?P<value>\"[^\"]*\"|'[^']*')"
% self.namePattern)
Have you tested this with tags inside comments?
You are probably right. For me it boils down to these problems:
- Implementing a stack for large queues of documents which is faster
than list.pop(index) (Is there a lib for this?)
If you need a queue then use one: take a look at `collections.deque` or
the `Queue` module in the standard library.
Which of the two would you recommend for handling large queues with fast
response times?
`Queue.Queue` builds on `collections.deque` and is thread safe. Speedwise
I don't think this makes a difference as the most time is spend with IO
and parsing. So if you make your spider multi-threaded to gain some speed
go with `Queue.Queue`.
Ciao,
Marc 'BlackJack' Rintsch
.
- Follow-Ups:
- Re: WebCrawler (was: 'Question concerning this list')
- From: Thomas Ploch
- Re: WebCrawler (was: 'Question concerning this list')
- References:
- Question concerning this list
- From: Thomas Ploch
- Re: Question concerning this list
- From: Steven D'Aprano
- Re: Question concerning this list
- From: Thomas Ploch
- Re: Question concerning this list
- From: Marc 'BlackJack' Rintsch
- Re: Question concerning this list [WebCrawler]
- From: Thomas Ploch
- Question concerning this list
- Prev by Date: Re: Are all classes new-style classes in 2.4+?
- Next by Date: Re: Are all classes new-style classes in 2.4+?
- Previous by thread: Re: Question concerning this list [WebCrawler]
- Next by thread: Re: WebCrawler (was: 'Question concerning this list')
- Index(es):