Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)

From: John Dunlop (john+usenet_at_johndunlop.info)
Date: 03/05/04


Date: Fri, 5 Mar 2004 16:49:44 -0000

Pedro Graca wrote:

> Although I've heard often enough that RXs are not the best tool for this
> job (try a HTML or XML parser) I do very well with them myself :)

I believe the principal reason why pre-written parsers are suggested
and recommended instead of impromptu regular expression "one-liners"
is that the gurus who've written and developed the parsers are
usually aware of and understand the rules; the "one-line" regex
implementors, on the other hand -- with all due respect -- generally
aren't and don't. I'm not going to pretend I understand everything
SGML; I certainly don't; I'm far too young for starters.

I'd like to pass a few comments, nevertheless, which might change
your mind about regular expressions for parsing (X)HTML. They
changed my mind, anyway. You'll understand though, hopefully, why I
haven't offered any regular expression in place of yours (no, it's
not because I couldn't be bothered :-)).

(Trying to cope with shorthand markup when using regexes would be a
nightmare. Unlike proper parsers, I'm going to act like a browser
and ignore shorthand markup, for the time being, as it'd complicate
matters even more.)

> // get all "<input ... >"s -- usually I'd group them by <form>s too
> preg_match_all('@(<input[^>]+>)@Ui', $contents, $inputs);

There's the standard mistake: the next occurrence of ">" does not
necessarily mark the end of the tag. In HTML, a ">" can appear in
*quoted* attribute values; it cannot appear in unquoted attribute
values. This, for example, is a valid INPUT element (I make no
claims to its logicality!)

<INPUT title=">">

Also, INPUTs have no required attributes (that is, "<INPUT>" is
valid), but the "+" quantifier matches *one* or more of whatever came
before. To over-simplistically match INPUTs, I'd substitute "*" for
"+". Since you're only wanting to match those INPUTs with explicit
type, name and value attributes, however, that's inconsequential.

> // inside each "<input ... >" isolate the pairs "attr=value"
> foreach ($inputs[1] as $input) {
> // once for double quoted values
> preg_match_all('@(([^\s<>]+)\s*=\s*"([^"<>]+)")@', $input, $matches);

An SGML name begins with a name start character and is followed by
zero or more name characters. You'd match a name, for HTML4.01, with
the pattern

[a-zA-Z][a-zA-Z0-9.-_:]*

An attribute value may be of length zero, so, again, the quantifier
"*" ought to be used. And inside quoted attribute values, both "<"
and ">" can appear. Alvaro G Vicario has just pointed this out too,
in an article in the thread "php sticky forms",

<news:1qih21wt0xy4e$.1f5ehf0s1tf5a$.dlg@40tude.net>.

> // once for single quoted values
> preg_match_all('@(([^\s<>]+)\s*=\s*\'([^\'<>]+)\')@', $input, $matches);

Ditto.

> // and once again for unquoted values
> preg_match_all('@(([^\s<>]+)\s*=\s*([^\s<>"\']+))@', $input, $matches);

Unquoted attribute values may only contain name characters. In
HTML4.01, the pattern

[a-zA-Z0-9.-_:]*

matches name characters.

Phew!

Refs.:

 http://www.w3.org/TR/html401/sgml/sgmldecl.html
 http://xml.coverpages.org/sgmlsyn/sgmlsyn.htm

-- 
Jock


Relevant Pages

  • Re: Expert script (.bat) writers help needed (strip double-quote from string)
    ... Sets or returns the regular expression pattern being searched for. ... Always a RegExp object variable. ... May include any of the regular expression characters defined in the table in the Settings section. ...
    (microsoft.public.windowsxp.help_and_support)
  • Re: RegExp irregularity in JScript
    ... of characters in the string is at ... All three strings match if the pattern is "."; ... the pattern as a submatch ") the entire string is returned, ... This looks like a bug in Microsoft's regular expression implementation (it ...
    (microsoft.public.scripting.jscript)
  • Re: Escaping problem using Regular Expression
    ... Including the characters. ... specify has an invalid part which is a pattern group where any member can ... Moving the] to the front of the string means that the regex parser will not ... > I totally do not understand how this regular expression escaping works. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Escaping problem using Regular Expression
    ... > Hi Henry, ... that makes the pattern invalid. ... > any of the characters that can be interpreted as part of a pattern. ... >> I totally do not understand how this regular expression escaping works. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Regular Expression Function
    ... I want a regular expression to compare sentences and then rate them as ... I have an array with a list of other phrases like so... ... characters will throw things off. ... "In an hour the system will go down for maintenance". ...
    (alt.php)