Re: Teach me how to fish, regexp
From: Martien Verbruggen (mgjv_at_tradingpost.com.au)
Date: 10/08/03
- Next message: Martien Verbruggen: "Re: GD::Graph: "mixed" graph doesn't recognize "area" graph type"
- Previous message: Glen Hendry: "Segmentation Fault - core dumped. Do I have latest version ?"
- In reply to: Henry: "Re: Teach me how to fish, regexp"
- Next in thread: Henry: "Re: Teach me how to fish, regexp"
- Reply: Henry: "Re: Teach me how to fish, regexp"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 08 Oct 2003 02:12:15 GMT
[rewrapped long lines]
On Tue, 07 Oct 2003 19:12:29 GMT,
Henry <henryn@zzzspacebbs.com> wrote:
> Martien Vebruggen:
>
> Thank you for your response to my post:
>
> in article slrnbo4p6t.pv1.mgjv@verbruggen.comdyn.com.au, Martien Verbruggen
> at mgjv@tradingpost.com.au wrote on 10/7/03 12:01 AM:
>
>> On Tue, 07 Oct 2003 04:34:05 GMT, Henry <henryn@zzzspacebbs.com>
>> wrote: Folks:
>>> Seems the best way to deal with this is to slurp, and use "split"
>>> with the appropriate regexp. Wrinkle: I need to retain the
>>> section numbers in the return strings.
>>>
>> I would probably set the input record separator ($/, see perlvar)
>> to "", which will treat two or more consecutive newlines as the
>> record separator. Then each record starts with the number you're
>> interested in.
> Right, that's what I finally did, in effect. (I did something
> similar at the "split".) But this isn't very robust, I think: it
> depends on some typist somewhere _always_ following the rules.
>
> I think you are saying that slurp mode may not be the best choice.
>
> As far as your setting
>
> $/ = "";
>
> This is not exactly intuitive from the point of view of a newcomer.
> Sorry, could you help me understand (or give me a blind rule of
> thumb) how what looks like setting a variable to an empty string
> implies "two or more successive newlines"?
The perlvar documentation explains what $/ (the input record
separator) does, and that it has a "special" setting of the empty
string, which makes it reads "paragraphs", i.e. blocks of text
separated by two or more newlines.
> Thanks for taking all the trouble to explain the components in detail:
>> /
>> ^ # from the beginning of the record
>
> Right.
>
>> ( # start capture
>
> Capture? I guess you mean the mysterious "save the stuff you match"
> mechanisms I've found in some perl references. The explanations
> I've found are very short and not very useful. Also: I find it
> hard to discriminate between parens used for operation grouping and
> this use.
Yes. Capturing parentheses "save" whatever is matched between them,
and return it as a result of the operation, as well as in the named
variables $1, $2, etc.. At the same time they group multiple
characters together to form a single subpattern.
There is more information about this in the perlre documentation, as
well as in the perlop documentation under the entry for
"m/PATTERN/cgimosx".
>> (?: # start grouping, but no capturing
>
> Sorry, could you speak more fully about this? Again, I haven't
> found a good reference for this stuff.
If you only want to group some stuff together in a subpattern, but you
don't want that match of that subpattern returned as one of the digit
variables, or in the return list, you use (?:PATTERN). Again, see the
perlre documentation for a full explanation.
>> .\ \ # literal . followed by two spaces
>
> Sorry, I don't get that. Could you explain more fully? I think that I
> understand that a period, unescaped, matches any character, so I would
> expect that you'd have to escape before the period to match a literal
> period/decimal point.
You're right. my mistake in transcribing the regular expression. there
should be a backslash in front of the dot.
>> (.*) # capture the rest of the record
>
> I think I understand that
>
> .*
>
> means "any character, repeated 0 or more times", but I don't get how the
> parens lead to capture (and not operation grouping, as above) and eventual
> appearance of the captured data somewhere.
It does both. They group, and as a side effect, the matched subpattern
gets captured and returned (in this case as the second element of the
returned list, as well as in $2).
>> The first capturing set of parentheses returns the paragraph
>> number, including the sub-number, if present, and the second
>> capturing parentheses set returns the "Blah, blah.." bit up to the
>> end of the record.
>
> Right, as I said above, I can't figure out how this aspect works.
> This may seem obvious to you but looks like a hidden (or magical)
> side-effect to me.
The fact that those grouped subpattern matches get returned (and saved
in $1, $2...) is more an effect of the m// operator (documented in
perlop) than of regular expressions themselves. However, they do get
captured in regular expressions, and you can refer back to them (with
\1, \2...) inside of the same regular expression.
>> Also see the perlvar and perlre documentation for more information.
>
> My desk and my screen are littered with various references. Thanks for
> pointing out these man "subreferences" -- I had not noticed them
man perl gives a rather complete list of all the various other manual
pages that are available.
>> If two newlines is not a record splitter, and you _have_ to use a
>> minimum of three, this won't work.
>
> Sorry, could you speak more fully about this? Is there a
> restriction I'm not seeing?
If, for example, your text is formatted like:
12345 Some text for paragraph 1
Some more text that belongs in paragraph two
12345.1 This is the second paragraph
Then setting $/ to "" would read the second part of the first
paragraph as a separate read, since it has two newlines between the
first and second bit. if there is text in your documents that is like
that, you can't use the first bunch of code (with $/ set to ""), but
you have to use the second bunch of code (with $/ set to "\n\n\n" or
possibly even "\n\n\n\n") and do a bit more work in removing trailing
and leading newlines.
>> #!/usr/local/bin/perl use warnings; use strict;
That's not what I posted. The newlines are important.
There are also a perlrequick and a perlretut manual page, which are
more gentle introductions to regular expressions than the perlre
reference documentation. You should probably have a bit of a read of
those.
Furthermore: Don't worry too much that some of this stuff looks
magical. It is. Perl is full of things that you just have to learn
about by immersion, and by repeated visits to the same documentation.
it can take a while before some of this stuff becomes automatic.
Martien
--
|
Martien Verbruggen | Unix is user friendly. It's just selective
Trading Post Australia | about its friends.
|
- Next message: Martien Verbruggen: "Re: GD::Graph: "mixed" graph doesn't recognize "area" graph type"
- Previous message: Glen Hendry: "Segmentation Fault - core dumped. Do I have latest version ?"
- In reply to: Henry: "Re: Teach me how to fish, regexp"
- Next in thread: Henry: "Re: Teach me how to fish, regexp"
- Reply: Henry: "Re: Teach me how to fish, regexp"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|