Re: WebCrawler (was: 'Question concerning this list')



Marc 'BlackJack' Rintsch schrieb:
In <mailman.2169.1167563637.32031.python-list@xxxxxxxxxx>, Thomas Ploch
wrote:

This is how my regexes look like:

import re

class Tags:
def __init__(self, sourceText):
self.source = sourceText
self.curPos = 0
self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*"
self.tagPattern = re.compile("<(?P<name>%s)(?P<attr>[^>]*)>"
% self.namePattern)
self.attrPattern = re.compile(
r"\s+(?P<attrName>%s)\s*=\s*(?P<value>\"[^\"]*\"|'[^']*')"
% self.namePattern)

Have you tested this with tags inside comments?

No, but I already see your point that it will parse _all_ tags, even if
they are commented out. I am thinking about how to solve this. Probably
I just take the chunks between comments and feed it to the regular
expressions.

You are probably right. For me it boils down to these problems:
- Implementing a stack for large queues of documents which is faster
than list.pop(index) (Is there a lib for this?)
If you need a queue then use one: take a look at `collections.deque` or
the `Queue` module in the standard library.
Which of the two would you recommend for handling large queues with fast
response times?

`Queue.Queue` builds on `collections.deque` and is thread safe. Speedwise
I don't think this makes a difference as the most time is spend with IO
and parsing. So if you make your spider multi-threaded to gain some speed
go with `Queue.Queue`.

I think I will go for collections.deque (since I have no intention of
making it multi-threaded) and have several queues, one for each server
in a list to actually finish one server before being directed to the
next one straight away (Is this a good approach?).

Thanks a lot,
Thomas


.



Relevant Pages

  • Re: Search webbot in an Apache server
    ... To clear up the edit of .php files, if the file only contains standard HTML tags, then FP has no ... .asp if on a Windows IIS server with Index Server installed and running and managing the search for ... > Note that my problem is not an error message, so I don't know what you mean ...
    (microsoft.public.frontpage.client)
  • Re: 100 Things Restaurant Staffers Should Never Do...
    ... I want the server to ... Employees who have contact with the public are made to wear ... police wear name tags ... profession, it's Nurse, get your butt over here with my pain pill, ...
    (rec.food.cooking)
  • Re: Use of Full text Search and relation with Index server for te
    ... and what is in the body tags. ... Looking for a SQL Server replication book? ... >> URL for procedure to implement Free Text Search via Indexing Service ...
    (microsoft.public.sqlserver.fulltext)
  • tags disappearing
    ... I created a seating chart with tags and CSS. ... development box and our test server. ... Production Server: Windows Server 2003 Web Edition sp1 ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: Search webbot in an Apache server
    ... I am not sure what you mean by "parse those tags"; ... not show the php in the "Normal" view but it does show the rest of the HTML ... As far as putting the search component in a page with a html extension, ... > .asp if on a Windows IIS server with Index Server installed and running ...
    (microsoft.public.frontpage.client)