Re: python fast HTML data extraction library



On Jul 23, 3:53 am, Paul McGuire <pt...@xxxxxxxxxxxxx> wrote:
# You should use raw string literals throughout, as in:
# blah_re = re.compile(r'sljdflsflds')
# (note the leading r before the string literal).  raw string
literals
# really help keep your re expressions clean, so that you don't ever
# have to double up any '\' characters.

Thanks, I didn't know about that, updated my code.

# Attributes might be enclosed in single quotes, or not enclosed in
any quotes at all.
attr_re = re.compile('([\da-z]+?)\s*=\s*\"(.*?)\"', re.DOTALL |
re.UNICODE | re.IGNORECASE)

Of course, you mean attribute's *value* can be enclosed in single/
double quotes?
To be true, I haven't seen single quote variant in HTML lately but I
checked it and it seems to be in the specs and it can be even quite
useful (man learns something every day).
Thank you for pointing that one out, I updated the code accordingly
(just realized that condition check REs need an update too :/).

As far as the lack of value quoting is concerned, I am not so sure I
need this - It would significanly obfuscate my REs and this practice
is rather deprecated, considered unsafe
and I've seen it only in very old websites.

How would you extract data from a table?  For instance, how would you
extract the data entries from the table at this URL:http://tf.nist.gov/tf-cgi/servers.cgi?  This would be a good example
snippet for your module documentation.

This really seems like a nice example. I'll surely explain it in my
docs (examples are surely needed there ;)).

Try extracting all of the <a href=...>sldjlsfjd</a> links from
yahoo.com, and see how much of what you expect actually gets matched.

The library was used in my humble production environment, processing a
few hundred thousand+ of pages and spitting out about 10000 SQL
records so it does work quite good with a simple task like extracting
all links. However, I can't really say that the task introduced enough
diversity (there were only 9 different page templates) to say that the
library is 'tested'...

On Jul 26, 5:51 pm, John Machin <sjmac...@xxxxxxxxxxx> wrote:
On Jul 23, 11:53 am, Paul McGuire <pt...@xxxxxxxxxxxxx> wrote:

On Jul 22, 5:43 pm, Filip <pink...@xxxxxxxxx> wrote:

# Needs re.IGNORECASE, and can have tag attributes, such as <BR
CLEAR="ALL">
line_break_re = re.compile('<br\/?>', re.UNICODE)

Just in case somebody actually uses valid XHTML :-) it might be a good
idea to allow for <br />

# what about HTML entities defined using hex syntax, such as &#xxxx;
amp_re = re.compile('\&(?![a-z]+?\;)', re.UNICODE | re.IGNORECASE)

What about the decimal syntax ones? E.g. not only &nbsp; and &#xa0;
but also &#160;

Also, entity names can contain digits e.g. &sup1; &frac34;

Thanks for pointing this out, I fixed that. Although it has very
little impact on how the library performs its main task (I'd like to
see some comments on that ;)).
.



Relevant Pages

  • Re: problem with spaces in quoted string arguments
    ... Janis Papanagnou wrote: ... are in double quotes. ... As can be seen in the output, the server called "photon hub" did not ... extract properly, since the space was detected in the argument to awk. ...
    (comp.unix.shell)
  • Re: Problem with Importing CSV with "=" inside a field value
    ... is a small Access database ... and have overcome a number of obstacles. ... it has an equal sign at the front and wrapped by quotes. ... Any way to skip parsing this field such that Access can extract the fields ...
    (microsoft.public.access.externaldata)
  • Re: Problem with Importing CSV with "=" inside a field value
    ... and have overcome a number of obstacles. ... it has an equal sign at the front and wrapped by quotes. ... This problematic field actually is of no use to me. ... Any way to skip parsing this field such that Access can extract the fields ...
    (microsoft.public.access.externaldata)
  • Problem with Importing CSV with "=" inside a field value
    ... and have overcome a number of obstacles. ... it has an equal sign at the front and wrapped by quotes. ... This problematic field actually is of no use to me. ... Any way to skip parsing this field such that Access can extract the fields ...
    (microsoft.public.access.externaldata)
  • Re: Extract until unquote or EOL
    ... > I wan't to extract the phrase/text between the two quotes. ... NAME = no quotation marks so grab all of this ... NAME = "solitary quotation mark at the beginning of line, so grab all ...
    (comp.lang.perl.misc)

Loading