Re: "negative" regexp



"PV" == Petr Vileta <stoupa@xxxxxxxxxxxxx> writes:

PV> No, I mean not ideal for using universally. I have concrete goal and I
PV> use as minimal resource as possible. For example if I want to extract
PV> clicable email addresses from html source I need to extract all
PV> /href=['"]*mailto:\s*(.+?)['"\s>/
PV> only.

besides the typo (no close ] on the right), that wouldn't always
work. it allows for an open ' and a closing " which is wrong. it doesn't
handle html comments which shouldn't be parsed for email
addresses. there are other problems with it that i can't get into. so
even a 'simple' thing like that is much harder to extract with a regex
than you think. use a module designed and tested to parse html and email
addresses. it is actually simpler coding from your point of view and
correct as well! and correct beats efficient every day.

uri

PV> HTML:Parser and WWW:Mechanize are good modules but in many case these
PV> are "too big gun" :-)

better a big accurate gun than a tiny pistol with no accuracy. you might
even shoot your eye out!

uri

--
Uri Guttman ------ uri@xxxxxxxxxxxxxxx -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
.



Relevant Pages

  • Re: Microsoft and attributes--will they ever figure them out?
    ...  That's what the DOM ... There is no such thing as an "unresolved URI". ... referred in W3C DOM Level 2 HTML. ... will serialize a document without such contamination. ...
    (comp.lang.javascript)
  • Re: Microsoft and attributes--will they ever figure them out?
    ... There is no such thing as an "unresolved URI". ... See also RFC 3986, ... consider only properties specified in W3C DOM Level 2 HTML. ...
    (comp.lang.javascript)
  • Re: longdesc with IMG
    ... >> I understand that longdesc takes a uri for a value. ... An HTML document? ... the HTML page shouldn't contain images!), ...
    (comp.infosystems.www.authoring.html)
  • Re: display hit results in proper format
    ... its done at query time. ... in process and use that to extract the text. ... right but how is it converted from all kind of files into html or is ... 2- i wrote a hit highliter script that works only on html files,and i ...
    (microsoft.public.inetserver.indexserver)
  • Re: encoding c cedille
    ... HTML 4.01 forbids non-ASCII values in the URI; it recommends that these values, should they crop up, be treated as UTF-8. ... Some older user agents trivially process URIs in HTML using the bytes of the character encoding in which the document was received. ... User agents that want to handle these older documents should, on receiving a URI containing characters outside the legal set, first use the conversion based on UTF-8. ...
    (comp.lang.java.help)