Re: "negative" regexp



"PV" == Petr Vileta <stoupa@xxxxxxxxxxxxx> writes:

PV> No, I mean not ideal for using universally. I have concrete goal and I
PV> use as minimal resource as possible. For example if I want to extract
PV> clicable email addresses from html source I need to extract all
PV> /href=['"]*mailto:\s*(.+?)['"\s>/
PV> only.

besides the typo (no close ] on the right), that wouldn't always
work. it allows for an open ' and a closing " which is wrong. it doesn't
handle html comments which shouldn't be parsed for email
addresses. there are other problems with it that i can't get into. so
even a 'simple' thing like that is much harder to extract with a regex
than you think. use a module designed and tested to parse html and email
addresses. it is actually simpler coding from your point of view and
correct as well! and correct beats efficient every day.

uri

PV> HTML:Parser and WWW:Mechanize are good modules but in many case these
PV> are "too big gun" :-)

better a big accurate gun than a tiny pistol with no accuracy. you might
even shoot your eye out!

uri

--
Uri Guttman ------ uri@xxxxxxxxxxxxxxx -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
.



Relevant Pages

  • Re: longdesc with IMG
    ... >> I understand that longdesc takes a uri for a value. ... An HTML document? ... the HTML page shouldn't contain images!), ...
    (comp.infosystems.www.authoring.html)
  • Re: display hit results in proper format
    ... its done at query time. ... in process and use that to extract the text. ... right but how is it converted from all kind of files into html or is ... 2- i wrote a hit highliter script that works only on html files,and i ...
    (microsoft.public.inetserver.indexserver)
  • Re: encoding c cedille
    ... HTML 4.01 forbids non-ASCII values in the URI; it recommends that these values, should they crop up, be treated as UTF-8. ... Some older user agents trivially process URIs in HTML using the bytes of the character encoding in which the document was received. ... User agents that want to handle these older documents should, on receiving a URI containing characters outside the legal set, first use the conversion based on UTF-8. ...
    (comp.lang.java.help)
  • Re: regex puzzle!
    ... The HTML source that I am extracting from does not include ... > will extract 400 characters from an HTML source, ... >> closing tags are recovered. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: regex puzzle!
    ... The HTML source that I am extracting from does not include ... > will extract 400 characters from an HTML source, ... >> closing tags are recovered. ...
    (microsoft.public.dotnet.framework)