Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)
From: John Dunlop (john+usenet_at_johndunlop.info)
Date: 03/05/04
- Next message: Annie: "Re: Session_ID()"
- Previous message: Xenophobe: "Re: Upload File Type Problem"
- In reply to: Pedro Graca: "Re: Help with a regular expression"
- Next in thread: Pedro Graca: "Re: Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)"
- Reply: Pedro Graca: "Re: Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Fri, 5 Mar 2004 16:49:44 -0000
Pedro Graca wrote:
> Although I've heard often enough that RXs are not the best tool for this
> job (try a HTML or XML parser) I do very well with them myself :)
I believe the principal reason why pre-written parsers are suggested
and recommended instead of impromptu regular expression "one-liners"
is that the gurus who've written and developed the parsers are
usually aware of and understand the rules; the "one-line" regex
implementors, on the other hand -- with all due respect -- generally
aren't and don't. I'm not going to pretend I understand everything
SGML; I certainly don't; I'm far too young for starters.
I'd like to pass a few comments, nevertheless, which might change
your mind about regular expressions for parsing (X)HTML. They
changed my mind, anyway. You'll understand though, hopefully, why I
haven't offered any regular expression in place of yours (no, it's
not because I couldn't be bothered :-)).
(Trying to cope with shorthand markup when using regexes would be a
nightmare. Unlike proper parsers, I'm going to act like a browser
and ignore shorthand markup, for the time being, as it'd complicate
matters even more.)
> // get all "<input ... >"s -- usually I'd group them by <form>s too
> preg_match_all('@(<input[^>]+>)@Ui', $contents, $inputs);
There's the standard mistake: the next occurrence of ">" does not
necessarily mark the end of the tag. In HTML, a ">" can appear in
*quoted* attribute values; it cannot appear in unquoted attribute
values. This, for example, is a valid INPUT element (I make no
claims to its logicality!)
<INPUT title=">">
Also, INPUTs have no required attributes (that is, "<INPUT>" is
valid), but the "+" quantifier matches *one* or more of whatever came
before. To over-simplistically match INPUTs, I'd substitute "*" for
"+". Since you're only wanting to match those INPUTs with explicit
type, name and value attributes, however, that's inconsequential.
> // inside each "<input ... >" isolate the pairs "attr=value"
> foreach ($inputs[1] as $input) {
> // once for double quoted values
> preg_match_all('@(([^\s<>]+)\s*=\s*"([^"<>]+)")@', $input, $matches);
An SGML name begins with a name start character and is followed by
zero or more name characters. You'd match a name, for HTML4.01, with
the pattern
[a-zA-Z][a-zA-Z0-9.-_:]*
An attribute value may be of length zero, so, again, the quantifier
"*" ought to be used. And inside quoted attribute values, both "<"
and ">" can appear. Alvaro G Vicario has just pointed this out too,
in an article in the thread "php sticky forms",
<news:1qih21wt0xy4e$.1f5ehf0s1tf5a$.dlg@40tude.net>.
> // once for single quoted values
> preg_match_all('@(([^\s<>]+)\s*=\s*\'([^\'<>]+)\')@', $input, $matches);
Ditto.
> // and once again for unquoted values
> preg_match_all('@(([^\s<>]+)\s*=\s*([^\s<>"\']+))@', $input, $matches);
Unquoted attribute values may only contain name characters. In
HTML4.01, the pattern
[a-zA-Z0-9.-_:]*
matches name characters.
Phew!
Refs.:
http://www.w3.org/TR/html401/sgml/sgmldecl.html
http://xml.coverpages.org/sgmlsyn/sgmlsyn.htm
-- Jock
- Next message: Annie: "Re: Session_ID()"
- Previous message: Xenophobe: "Re: Upload File Type Problem"
- In reply to: Pedro Graca: "Re: Help with a regular expression"
- Next in thread: Pedro Graca: "Re: Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)"
- Reply: Pedro Graca: "Re: Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|