Re: Effective/Proper use of "regular expressions"



On 18 mar, 21:37, Richard Owlett <rowl...@xxxxxxxxxxxxx> wrote:
<snip>

I don't really understand how to parse your second line.
All I know is "IT WORKS".

So what you are saying is that you don't understand Bill's regular
expression. The explanation can be found in the re_syntax man page
you've been reading, but let's break it down. Here's the expression,
for ease of reference:

([0-9]+\.?[0-9]+)

Parentheses create sub-expressions. By including several
parenthesized elements in your regular expression, you can capture
several different pieces of a larger pattern and store each one in its
own variable, just as Bill has done with the variable "number". In
the expression above, the parentheses are superfluous, because no part
of the regular expression occurs outside of them; therefore, the
result of the sub-expression (stored in the variable "number") will
always be identical to the result of the whole expression (stored in
the variable "mv").

Square brackets create "bracket expressions". These denote a set of
characters; the bracket expression will match any single character
from the set (unless the first character in the bracket expression is
"^"--in that case the expression will match any character which is NOT
in the set).

Inside a bracket expression, you can specify a range of characters
using a hyphen. The character before the hyphen is the first
character in the range; the character after the hyphen is the last
character in the range. The range 0-9 is the set of all digits, so
the bracket expression [0-9] will match any single digit.

The plus sign is a quantifier; it tells the regular expression engine
to match one or more occurrences of the previous atom. In this case,
the previous atom is the bracket expression [0-9]; so the expression
[0-9]+ means "match one or more digits."

A backslash followed by a non-alphanumeric character tells the regular
expression engine to treat the character literally. (A period without
a preceding backslash is a special regular expression symbol, meaning
"match any single character"; a period immediately preceded by a
backslash means "match a literal period".)

A question mark is a quantifier; it tells the regular expression
engine to match zero or one occurrence of the previous atom. In this
case, the previous atom is \.; so the expression \.? means "match a
single period if one is present." I like to think of the ? as meaning
"optional", but more accurately it means "if the atom is there, match
it; otherwise, match an empty string."

So, the sub-expression [0-9]+\.?[0-9]+ means "match one or more
digits, followed by an 'optional' period, followed by one or more
digits." Note that this expression will not match single-digit
numbers, among other things.

The regular expression in my previous post was not a response to
Bill's; we were writing our replies at the same time and I didn't see
his until after I posted mine. Since you are learning about regular
expressions, though, let me explain mine and Bill's response to it.

The expression I proposed was:

\d+(\.\d+)?

\d is a "class-shorthand escape"; it matches any single digit and is
synonymous with Bill's bracket expression [0-9]. We've already
discussed all the other elements in the expression. The subexpression
\.\d+ matches a literal period followed by one or more digits. I
surrounded this in parentheses and quantified it with a question mark:
(\.\d+)? This means "if possible, match one occurrence of the
following pattern: a literal period followed by one or more digits".
The entire expression \d+(\.\d+)? means "match one or more digits, as
well as one occurrence of the following pattern, if possible: a
literal period followed by one or more digits."

Bill pointed out a flaw in my pattern: it won't properly recognize
numbers that begin with a decimal point.

Bill proposes the expression

(\d+(?:\.\d+)?|\d*\.\d+)

to address this issue. This expression includes three elements we
haven't seen before. First is the notation (?:...). This is called a
"non-capturing" set of parentheses. Non-capturing parentheses won't
create a variable containing the match of the sub-expression.
(Remember that the subexpression in Bill's original expression was
captured in the variable "number"; a non-capturing subexpression could
not be captured in this way.) Other than that they behave like
capturing parentheses.

The second element we haven't seen before is the asterisk. This is
another quantifier telling the regular expression engine to match zero
or more occurrences of the preceding atom, which in this case is \d.
So, \d* means "match zero or more digits."

The final new element is the pipe. A regular expression or
subexpression may contain a number of "branches" separated by pipes;
the regular expression or subexpression will match any of the
branches. The first branch in the above expression is \d+(?:\.\d+)? ,
which means "match one or more digits and one occurrence of the
following if possible: a literal period and one or more digits." The
second branch is \d*\.\d+ , which means "match zero or more digits, a
literal period, and one or more digits." The entire subexpression \d+
(?:\.\d+)?|\d*\.\d+ means "match either of the following patterns: 1)
one or more digits and the following if possible: a literal period and
one or more digits; 2) zero or more digits, a literal period, and one
or more digits."

Again, in this case the parentheses around the entire expression are
superfluous because there is nothing in the expression that is not
also in the subexpression. But you can imagine a situation where the
parentheses would serve a purpose; for example, given a string such as
"{key foo 123.46 bar}", you might want to capture the number and then
the following word. To do this, you might use an expression like
this:

(\d+(?:\.\d+)?|\d*\.\d+)\s(\w+)

\s indicates a single whitespace character; \w indicates a single
alphanumeric character. The first set of capturing parentheses
capture the number; the second set capture the following word. You'd
use this expression in conjunction with a call to [regexp] like this:

set string {key foo 123.46 bar}
set re {(\d+(?:\.\d+)?|\d*\.\d+)\s(\w+)}
regexp $re $string entireMatch number word

I hope all of that helps to demystify regular expressions a bit.

Regards,
Aric
.



Relevant Pages

  • Re: Effective/Proper use of "regular expressions"
    ... So what you are saying is that you don't understand Bill's regular ... from the set (unless the first character in the bracket expression is ... The range 0-9 is the set of all digits, ... a literal period followed by one or more digits". ...
    (comp.lang.tcl)
  • Re: Please I need help with especific case of permutations algorithm (not usually)
    ... So logically you can stamp your desired digits onto the number to get ... Most MODERN programming language compilers have "Regular Expression" ... = Match the preceding expression at least n times. ...
    (comp.compression)
  • Re: inputting the ephemerides (SOLUTION!)
    ... the user manual input of control data, or the use of a control record, ... Somewhere along the way, I've lost the last float, which Terence calls F3. ... C STOPS ON TRAILING BLANK OR CHARACTER ... C NOW ONLY DIGITS ...
    (comp.lang.fortran)
  • Re: inputting the ephemerides (SOLUTION!)
    ... Defining the rules for field parsing has to match all the ... these are placed in the character string CWK). ... string stops the parsing and GOs to statement 7. ... Here the string CWK of sign, digits and a decimal point if present are ...
    (comp.lang.fortran)
  • Re: inputting the ephemerides (SOLUTION!)
    ... Defining the rules for field parsing has to match all the ... these are placed in the character string CWK). ... string stops the parsing and GOs to statement 7. ... Here the string CWK of sign, digits and a decimal point if present are ...
    (comp.lang.fortran)