Re: python fast HTML data extraction library
- From: Filip <pinkeen@xxxxxxxxx>
- Date: Sun, 26 Jul 2009 16:44:54 -0700 (PDT)
On Jul 23, 3:53 am, Paul McGuire <pt...@xxxxxxxxxxxxx> wrote:
# You should use raw string literals throughout, as in:
# blah_re = re.compile(r'sljdflsflds')
# (note the leading r before the string literal). raw string
literals
# really help keep your re expressions clean, so that you don't ever
# have to double up any '\' characters.
Thanks, I didn't know about that, updated my code.
# Attributes might be enclosed in single quotes, or not enclosed in
any quotes at all.
attr_re = re.compile('([\da-z]+?)\s*=\s*\"(.*?)\"', re.DOTALL |
re.UNICODE | re.IGNORECASE)
Of course, you mean attribute's *value* can be enclosed in single/
double quotes?
To be true, I haven't seen single quote variant in HTML lately but I
checked it and it seems to be in the specs and it can be even quite
useful (man learns something every day).
Thank you for pointing that one out, I updated the code accordingly
(just realized that condition check REs need an update too :/).
As far as the lack of value quoting is concerned, I am not so sure I
need this - It would significanly obfuscate my REs and this practice
is rather deprecated, considered unsafe
and I've seen it only in very old websites.
How would you extract data from a table? For instance, how would you
extract the data entries from the table at this URL:http://tf.nist.gov/tf-cgi/servers.cgi? This would be a good example
snippet for your module documentation.
This really seems like a nice example. I'll surely explain it in my
docs (examples are surely needed there ;)).
Try extracting all of the <a href=...>sldjlsfjd</a> links from
yahoo.com, and see how much of what you expect actually gets matched.
The library was used in my humble production environment, processing a
few hundred thousand+ of pages and spitting out about 10000 SQL
records so it does work quite good with a simple task like extracting
all links. However, I can't really say that the task introduced enough
diversity (there were only 9 different page templates) to say that the
library is 'tested'...
On Jul 26, 5:51 pm, John Machin <sjmac...@xxxxxxxxxxx> wrote:
On Jul 23, 11:53 am, Paul McGuire <pt...@xxxxxxxxxxxxx> wrote:
On Jul 22, 5:43 pm, Filip <pink...@xxxxxxxxx> wrote:
# Needs re.IGNORECASE, and can have tag attributes, such as <BR
CLEAR="ALL">
line_break_re = re.compile('<br\/?>', re.UNICODE)
Just in case somebody actually uses valid XHTML :-) it might be a good
idea to allow for <br />
# what about HTML entities defined using hex syntax, such as &#xxxx;
amp_re = re.compile('\&(?![a-z]+?\;)', re.UNICODE | re.IGNORECASE)
What about the decimal syntax ones? E.g. not only and  
but also  
Also, entity names can contain digits e.g. ¹ ¾
Thanks for pointing this out, I fixed that. Although it has very
little impact on how the library performs its main task (I'd like to
see some comments on that ;)).
.
- References:
- python fast HTML data extraction library
- From: Filip
- Re: python fast HTML data extraction library
- From: Paul McGuire
- python fast HTML data extraction library
- Prev by Date: Re: How to comment constant values?
- Next by Date: Re: Distinguishing active generators from exhausted ones
- Previous by thread: Re: python fast HTML data extraction library
- Next by thread: Re: python fast HTML data extraction library
- Index(es):
Relevant Pages
|
Loading