Re: WebCrawler (was: 'Question concerning this list')
- From: Thomas Ploch <Thomas.Ploch@xxxxxxx>
- Date: Sun, 31 Dec 2006 14:30:58 +0100
Marc 'BlackJack' Rintsch schrieb:
In <mailman.2169.1167563637.32031.python-list@xxxxxxxxxx>, Thomas Ploch
wrote:
This is how my regexes look like:
import re
class Tags:
def __init__(self, sourceText):
self.source = sourceText
self.curPos = 0
self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*"
self.tagPattern = re.compile("<(?P<name>%s)(?P<attr>[^>]*)>"
% self.namePattern)
self.attrPattern = re.compile(
r"\s+(?P<attrName>%s)\s*=\s*(?P<value>\"[^\"]*\"|'[^']*')"
% self.namePattern)
Have you tested this with tags inside comments?
No, but I already see your point that it will parse _all_ tags, even if
they are commented out. I am thinking about how to solve this. Probably
I just take the chunks between comments and feed it to the regular
expressions.
Which of the two would you recommend for handling large queues with fastYou are probably right. For me it boils down to these problems:If you need a queue then use one: take a look at `collections.deque` or
- Implementing a stack for large queues of documents which is faster
than list.pop(index) (Is there a lib for this?)
the `Queue` module in the standard library.
response times?
`Queue.Queue` builds on `collections.deque` and is thread safe. Speedwise
I don't think this makes a difference as the most time is spend with IO
and parsing. So if you make your spider multi-threaded to gain some speed
go with `Queue.Queue`.
I think I will go for collections.deque (since I have no intention of
making it multi-threaded) and have several queues, one for each server
in a list to actually finish one server before being directed to the
next one straight away (Is this a good approach?).
Thanks a lot,
Thomas
.
- References:
- Question concerning this list
- From: Thomas Ploch
- Re: Question concerning this list
- From: Steven D'Aprano
- Re: Question concerning this list
- From: Thomas Ploch
- Re: Question concerning this list
- From: Marc 'BlackJack' Rintsch
- Re: Question concerning this list [WebCrawler]
- From: Thomas Ploch
- Re: Question concerning this list [WebCrawler]
- From: Marc 'BlackJack' Rintsch
- Question concerning this list
- Prev by Date: Re: A question about unicode() function
- Next by Date: request for code : Py Tic Tac Toe in action
- Previous by thread: Re: Question concerning this list [WebCrawler]
- Next by thread: Are all classes new-style classes in 2.4+?
- Index(es):