Re: Regular Expression
- From: "Michael A. Cleverly" <michael@xxxxxxxxxxxx>
- Date: Tue, 2 Jan 2007 19:29:20 -0700
On Tue, 2 Jan 2007, zowtar wrote:
I have the matchs... I want a regular expression for filter only the
url...
MATCH:
href="http://site.com/0,,NEWS39104-EI8090,00.html"
href="javascript:abre('http://site.com/0,,NEWS39104-EI8090,00.html','Gallery39104','660','500','no');"
CODE 01:
href="(?:.*)?((?:ftp|http|https)://(?:[^:/]+)(?::[0-9]{1,5})?(?:/.*)?.+)"
Changing your final .+ to [^'"]+ could help...
return
OK - http://site.com/0,,NEWS39104-EI8090,00.html
ERROR -
http://site.com/0,,NEWS39104-EI8090,00.html','Gallery39104','660','500','no');
CODE 02:
href="(?:.*)?((?:ftp|http|https)://(?:[^:/]+)(?::[0-9]{1,5})?(?:/.*)?.+?)(?:\',\'.*\',\'.*\',\'.*\',\'.*\'\);)?"
You seem to be using a comma for alternation--that is to specify one of
several alternatives (since I can't imagine any HTML fragment that would
list them all in that order separated by commas). HOWEVER, you specify
alternation in a regular expression with | not ,.
return
OK - http://site.com/0,,NEWS39104-EI8090,00.html
ERROR -
http://site.com/0,,NEWS39104-EI8090,00.html','Gallery39104','660','500','no');
If I were writing a regular expression to pluck out the URLs in your
example I'd use:
set RE {(?xi) # an expanded (case insensitive) regexp
(?:https?|ftp):// # protocol
[^"'/"]+ # host, possibly port, or user@pass for ftp
(?:/~?[a-z%0-9,._+?&=/-]+)? # other chars should be urlencoded (%## ...)
}
Note: I put double quotes in the negated character set [^"'/"] twice only
to make my syntax highlighting editor happy...
Michael
.
- References:
- Regular Expression
- From: zowtar
- Regular Expression
- Prev by Date: Re: get week and year of date
- Next by Date: Re: Tcl faster than Perl/Python...but only with tricks...
- Previous by thread: Regular Expression
- Next by thread: Getting and storing user input in expect
- Index(es):
Relevant Pages
|
Loading