Re: Regex: deleting non-matching words



On Aug 22, 2:06 pm, pete <no_one_you_k...@xxxxxxxxxxxxxxxxxx> wrote:
I have input strings where some words start with an underscore. The plan
is to remove all words that do NOT strt with an underscore and simply
keep the rest. So for example starting with
"word1 word2 _word3 word4 word5 _word6 _word7 word8"

I'm trying to end up with
"_word3 _word6 _word7"

The expression I have got so far is s/.*?(_[a-z0-9]+).*?/ $1/gi;
and my understanding is as follows:
The first ".*?" part removes everything up to the first matching RE
The "(_[a-z0-9]+)" matches any letter/number combination that starts
with an underscore [sidenote: yes, I know: \w+]
The final ".*?" removes everything up to the next match, or up to
the end of the string.

Here's how I have the RE in a program
$_=(<>);
s/.*?(_[a-z0-9]+).*?/ $1/gi;
print "Have: $_";

and here's how I run it:
echo "word1 word2 _word3 word4 word5 _word6 _word7 word8" | perl s.pl

and here's the output I get:
Have:  _word3 _word6 _word7 word8

Question: Why didn't "word8" get eaten like all its precedessors? and
what do I have to do to match it for removal.

If you have time, I'm looking for enlightenment more than solutions. I
am obviously missing something crucial, but all the online tutorials
I've found stop short of explaining this sort of thing.

The problem is that once you've matched a
target substring, ie, _[a-z0-9]+ then the
regex .*? lazily stops as soon as possible
since .*? says match any character 0 or more
times minimally (also termed lazily). So the
regex lazily chooses 0 and completes a match.

That works but then the only glitch is that
the lazy .*? fails to consume the rest of the
string once the final target_word7 is found
and you're left with ' word8'.

One way to fix that:

s/ .*? # match minimally
( _[a-z0-9]+ | $ ) # up to target or eol
/ $1/gix;

Now the regex matches_word7, but then
tries to match one of two alternatives:

Either: _[a-z0-9]+
or: end-of-line

The former isn't found but latter is and
the rest of the string is consumed up to
the end-of-line just before \n.

--
Charles DeRykus
.



Relevant Pages

  • Re: Regex
    ... underscore ... ... I have a regex but it is not matching a string that it should match: ... I'm no regex expert, but the expression you show looks to me as though it will match only single-character strings. ...
    (microsoft.public.dotnet.languages.csharp)
  • Regex
    ... underscore ... ... I have a regex but it is not matching a string that it should match: ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: random underslashes and single regex
    ... >> Do you mean you want to match strings that contain ONLY one underscore in ... >that are obtained by inserting a single '_' into the string. ... Here's the regex broken down: ... Was NOT head of Gestapo AT ALL!" ...
    (comp.lang.perl.misc)
  • Re: Parsing a string, removing any NON alphanumeric characters using regex
    ... also the underscore '_' and dash '-' characters. ... Anything else in the string should be removed. ... I think my regex is looking like: ... How can I strip all the characters that I dont' want? ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Regex Question
    ... We start by compiling a regex: ... Then we define a pattern string. ... converts backslash combinations as special characters, ... Regular expressions use a lot of backslashes, and so it is useful to ...
    (comp.lang.python)