Re: "negative" regexp



"PV" == Petr Vileta <stoupa@xxxxxxxxxxxxx> writes:

PV> No, I mean not ideal for using universally. I have concrete goal and I
PV> use as minimal resource as possible. For example if I want to extract
PV> clicable email addresses from html source I need to extract all
PV> /href=['"]*mailto:\s*(.+?)['"\s>/
PV> only.

besides the typo (no close ] on the right), that wouldn't always
work. it allows for an open ' and a closing " which is wrong. it doesn't
handle html comments which shouldn't be parsed for email
addresses. there are other problems with it that i can't get into. so
even a 'simple' thing like that is much harder to extract with a regex
than you think. use a module designed and tested to parse html and email
addresses. it is actually simpler coding from your point of view and
correct as well! and correct beats efficient every day.

uri

PV> HTML:Parser and WWW:Mechanize are good modules but in many case these
PV> are "too big gun" :-)

better a big accurate gun than a tiny pistol with no accuracy. you might
even shoot your eye out!

uri

--
Uri Guttman ------ uri@xxxxxxxxxxxxxxx -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
.



Relevant Pages

  • Re: HTML Library
    ... DC> "Is there a library available to parse HTML? ... extract links and images from the body. ... apply regular expressions to this particular problem. ... DC> My point is that, when someone asks for a parser, telling them that can ...
    (comp.lang.lisp)
  • Re: Microsoft and attributes--will they ever figure them out?
    ...  That's what the DOM ... There is no such thing as an "unresolved URI". ... referred in W3C DOM Level 2 HTML. ... will serialize a document without such contamination. ...
    (comp.lang.javascript)
  • Re: Microsoft and attributes--will they ever figure them out?
    ... There is no such thing as an "unresolved URI". ... See also RFC 3986, ... consider only properties specified in W3C DOM Level 2 HTML. ...
    (comp.lang.javascript)
  • Re: longdesc with IMG
    ... >> I understand that longdesc takes a uri for a value. ... An HTML document? ... the HTML page shouldn't contain images!), ...
    (comp.infosystems.www.authoring.html)
  • Re: Financial time series data
    ... I would like to use Python to do the following. ... of my browser. ... this is the information that I would like to extract. ... for line in html: ...
    (comp.lang.python)