Re: help with a regex and greediness
From: Stuart White (poovite_at_yahoo.com)
Date: Sat, 13 Mar 2004 08:44:40 -0800 (PST) To: "R. Joseph Newton" <firstname.lastname@example.org>
--- "R. Joseph Newton" <email@example.com> wrote:
> Stuart White wrote:
> > Wow, yeah that helps a lot.
> > Here's a question: If if had:
> > $line = 'Spurs 94, Suns 82, Heat 99, Magic 74'
> > and then did a split on comma and comma's
> > spaces:
> > @result = split (/\s*,\s*/, $line);
> result? How specific is "result" to the issue at
> hand? Would not $score
This is me just trying to get an understanding of
split. In Learning Perl, there is a good example of
split on a line that is separated by colons. split
puts all the stuff that isn't a colon into an array,
one element at a time.
@fields = s;oit(/:/,$line);
#now @fields is ("merlyn:,
I was having trouble using this example to figure out
why my line wasn't splitting the way I wanted. So I
included this line:
$line = 'Spurs 94, Suns 82, Heat 99, Magic 74'
> > then @result would look like this, right?
> > @result = 'Spurs 94'
> You should know better than this by now, with the
> help you've been getting.
> With that @ symbol, you are referring to a slice--an
> array of one element. ***
> When you are referring to a scalar, use the scalar
> symbol $ ***
Yes, that was an error on my part. You're right, I
know that it should have been $result
> > @result = 'Suns 82'
> These first two make sense, pretty much. I think
> this is one place where $team1
> and $team2 might be more sensible, though it is even
> better, if there is some
> order to which team is listed first in the pairing,
> to have you identifier
> reflect that order, say $home_team and $visitor [if
> these are accurate of
I'll have to study the data to make sure that the home
team and visiting teams are consistently in the same
place. That's a good idea too.
> > @result = 'Heat 99'
> Going on to load more elements into the array does
> not make sense.. Does your
> data come in one continuous line, just a long string
> of team names separated by
My impression is that it came line by line.
> There would be no sense in
> doing the work of the split only to throw everything
> back in the same pile.
> There are a lot of different things you could do
> here, but the sensible ones
> would indicate that you should do something with the
> stats for each pairing
> before you go on to the next line.
> > @result = 'Magic 74'
> > If I wanted to split on the numbers as well, why
> > doesn't this work:
> > @result = split (/\s*\d*,\s*\d*/, $line);
> The previous post already explained this, and you
> have seen the result of what
> you are trying. You can't do that because the
> information disappears if you use
> it in the split expression.
I see. It seems that I didn't pick up on this
entirely, though I do remember reading it.
> Splitting the lines into a pair of team-score
> combinations is one step. It
> deserves a line of its own.
> Extracting the name and score from each team-score
> clause is another step that
> deserves a line or three of its own.
Ok, I didn't know this. I thought I could, and should
do it all in one or two lines. I get confused about
what data $_ has sometimes. After I run the initial
regex, I am usually extracting information from the
backreferences. When those backreferences or $_
contain more info than I want, my solution is to
tighten the original regex. You are suggesting that
instead of that, I ought to just run a second regex on
it, or a split on it in order to take the stress off
of Perl and keep the program efficient, right? Is
that what you are suggesting?
> > I just had a thought, it have to look more like:
> > @result = split (/(\s*|\d*),\s*\d*/, $line);
> Unless there is a compelling reason why you must do
> all your regex work for a
> line in one pass, you are better off not doing so.
> Though its Perl implementation is highly efficient,
> the regex process is very
> costly, and the cost rises much more through
> complexity of expression than
> through multiple runs.
Ok, see, I thought that the program would run much
more slowly if I kept running through the data. I
didn't think to use regular expressions in steps.
> Please review
> perldoc -f split
> for a better understanding. The split regex, is
> *what gets thrown away*. Do
> not put any data you may need in it.
The formatting of perldoc from the command line makes
it terribly difficult to read for me. There are huge
tabs between words. Is perldoc available in another
format, say the web? I've gone to the man pages at
perl.org, but I haven't found the equivalent of
perldoc. Perhaps I've passed over it? Having perldoc
in a format such that there is one space between each
word, two between each sentence and either an empty
line or a new line and a tab between paragraphs, would
make reading it much more beneficial. I just have not
found it in such form.
> I think an earlier poster may have confused the
> issue with the zero-or-more
> spaces before the comma. Unless the file format is
> very sloppy, this should not
> be necessary. Assume decent data,
> split /,\*/, $line;
> should split a line into its comma delimited
> elements. Nor reason to try to get
> fancy here. Just split on the comma to get two
the file format is not sloppy at all. I was just
confused as to why I couldn't use split on the first
space score comma and space, and then the next space
score. <-Perhaps I just answered my question right
there, seeing that in the second iteration, there is
no comma and then space.
> Keep it grounded--by choosing identifiers carefully
> to always communicate
> clearly what information they hold
> Keep it simple--most things are, if you let them be.
> Do one thing per line until you are using all of the
> basic constructs fluently.
> Pay close attention to the nature of each thing you
> are using a variable to
> describe, and make the containment class symbol [$,
> @, or %] that you use,
> reflects accurately whether you are referring to a
> container, or to an element
> held in the container.
I'm trying. Thanks for the advice.
Do you Yahoo!?
Yahoo! Mail - More reliable, more storage, less spam