Re: Help needed for perl rookie
From: Bob Walton (see_sig_at_invalid)
Date: 12/29/04
- Next message: Gerald Meazell: "Newbie question on require and semaphores"
- Previous message: George Cox: "Re: Is zero even or odd?"
- In reply to: GRLCOPM: "Re: Help needed for perl rookie"
- Next in thread: Jim Keenan: "Re: Help needed for perl rookie"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 28 Dec 2004 22:08:53 -0500
GRLCOPM wrote:
>>From: Bob Walton <see_sig@invalid>
...
>>GRLCOPM wrote:
>>
>>
>>>I am new to perl, but so far have had decent success in writing/modifying
>>>code to do what I want to do. However I am stuck trying to modify the
>>>following code. I am sure the solution is quite simple, but I can't
>>>completely figure out what this piece of code does. I think it is just
>>>matching up a data pattern but this is an area I am unfamiliar with.
>>>
>>>All I want to do is change the format of the data file from example #1 to
>>>example #2 and need this section of code to work with the new format. I
>>>would be grateful for any help provided in understanding what this piece of
>>>code does and suggestions on the modification needed.
>>>
>>>If more information or a larger chunk of the code is needed please let me
>>>know and I will provide.
>>>
>>>EXAMPLE #1 - Current format of data file:
>>>0000000050 20041227 0000000003 'my-page.shtml'
>>>0000000054 20041227 0000000004 'another-page.shtml'
>>>0000000020 20041227 0000000003 'yet-another-page.shtml'
>>>
>>>EXAMPLE #2 - New format of data file:
>>>0000000050|20041227|0000000003|my-page.shtml
>>>0000000054|20041227|0000000004|another-page.shtml
>>>0000000020|20041227|0000000003|yet-another-page.shtml
>>
>>Your example #2 is in "pipe-delimited" form -- the best way to
>>split it apart is with the split() function, as in:
>>
>>($acc,$day,$dayacc,$uri)=split /\|/,$line;
>>
>>
>>
>>>if (($acc,$day,$dayacc,$uri) = ($line =~ /^(\d+) (\d+) (\d+) '(\S+)'$/)) {
>>
>>
>>--
>>Bob Walton
>>Email: http://bwalton.com/cgi-bin/emailbob.pl
>
>
> Thanks Bob,
>
> I am familiar with the split function and have been looking for a solution
> that utilizes it, but the line you provided does not seem to work as a
> replacement for the line I included. I have been looking through
Here is an example using split:
use warnings;
use strict;
while(my $line=<DATA>){
chomp $line; #remove newline at end of line
if(my($acc,$day,$dayacc,$uri)=split /\|/,$line){
print "acc=$acc\nday=$day\ndayacc=$dayacc\nuri=$uri\n";
}
}
__END__
0000000050|20041227|0000000003|my-page.shtml
0000000054|20041227|0000000004|another-page.shtml
0000000020|20041227|0000000003|yet-another-page.shtml
That generates:
D:\junk>perl junk510.pl
acc=0000000050
day=20041227
dayacc=0000000003
uri=my-page.shtml
acc=0000000054
day=20041227
dayacc=0000000004
uri=another-page.shtml
acc=0000000020
day=20041227
dayacc=0000000003
uri=yet-another-page.shtml
D:\junk>
which seems to me to be what you want. If that isn't what you
want, please describe in full detail exactly what it is you do
want. Note that your statement "does not seem to work" doesn't
convey much information. What *exactly* did it do that you
didn't want it to do? What didn't it do that you did want it to
do? Did it generate any error messages? If so, what *exactly*
(copy/pasted, not retyped) were they?
Also note the use of a simplified example code complete with data
(and lacking unrelated obfuscating details) that illustrates the
point and that anyone can copy/paste/execute. Providing such is
good form in this newsgroup.
> documentation including the references you provided, but I am still having a
> hard time with this. I guess what I am looking for is someone to break down
> what is happening in this line so that I can modify it to work as I need it
> to. Here is the section of code in question.
>
> &LockOpen (COUNT,"$AccessFile");
> $location = tell COUNT;
> while ($line = <COUNT>) {
> if (($acc,$day,$dayacc,$uri) = ($line =~ /^(\d+) (\d+) (\d+) '(\S+)'$/)) {
> if ($uri eq $doc_uri) {
> last;
> }
> }
> last if ($uri eq $doc_uri);
> $location = tell COUNT;
> $acc = 0;
> $dayacc = 0;
> }
>
> And here is the specific line:
>
> if (($acc,$day,$dayacc,$uri) = ($line =~ /^(\d+) (\d+) (\d+) '(\S+)'$/)) {
OK, in detail:
The if(expression){block} statement tests an expression (in this
case, the scalarized results of a list assignment [i.e., the
length of the list assignment] from a regular expression match)
for a true value, and if true, it executes the statements in the
block (in this case, another if statement). Otherwise it does
not execute them. In the case of a pattern match, it is a *very*
good idea to test for the success of the pattern match before
using the purported results, as you are doing in this if statement.
Now, the expression executed is:
($acc,$day,$dayacc,$uri)=($line=~/^(\d+) (\d+) (\d+) '(\S+)'$/)
The lefthand side of the = is a list of four lvalues, which
lvalues will be assigned to the first four list elements
generated by the right-hand side. The right-hand side is:
($line=~/^(\d+) (\d+) (\d+) '(\S+)'$/)
which has an unneeded set of parens around it, so:
$line=~/^(\d+) (\d+) (\d+) '(\S+)'$/
which is a pattern-matching statement. The left-hand side of the
=~ matching operator designates the source of the string to be
matched. The right-hand side starts with a / , which indicates
to Perl that it is a shortcut for the "m" operator using /'s as
delimiters. Between the matching /'s then is a regular
expression. This regular expression contains many metacharacters
(characters with special meaning inside regular expressions).
Specifically:
^ -> start the match with the first character of the string
(anchored match)
(\d+) -> a parenthesized group "captures" the portion of the
string matched by the contents of the parens. Each capture
generates another element in the list output by the pattern match
(so there will be a four-element list generated by this regexp if
it matches).
\d+ -> The + metacharacter means the regexp element
immediately to the left of the + is repeated one or more times.
So in this case, a "\d" will be repeated one or more times.
\d -> This is a shortcut code for "any digit" (or, in other
words, the character class [0-9]). It matches any single digit.
Thus, we see that \d+ matches any string of one or more digits.
And (\d+) captures that string of one or more digits on the
output list.
space character -> the space character is not a
metacharacter, and is matched literally. Since it is not inside
of capturing parenthesis, it is not output on the output list.
Three occurrences of "(\d+) " occur, which will match three
strings of digits followed by space characters, and capture the
three strings of digits in the output list.
' -> the apostrophe is not a metacharacter, so it is matched
literally. It is not captured.
(\S+) -> captures the results of \S repeated one or more
times. \S is a shortcut code for any non-whitespace character.
' -> is a literal apostrophe
$ -> anchors the trailing end of the match at the end of the
string. In other words, if the string isn't exhaused at the
point where the $ metacharacter occurs, the match will backtrack
and try an alternative or fail if the alternatives are exhausted.
By default, a trailing newline (like what you've got with your
data) is permitted on the end of the string -- the match will
succeed if everything up to the newline has been matched.
So in English, your regexp will match a string that starts with
three repititions of strings of digits followed by a single space
character followed by ' followed by any string of non-whitespace
characters followed by ' followed by the end of the string. The
three strings of digits and the string of non-whitespace
characters will be captured and, upon match success, will be
assigned as the output list of the =~ match operator (and also,
BTW, in special variables $1, $2, $3 and $4, plus various pieces
of the match may be assigned to other builtin variables such as
$', $`, $&, @+, @-, etc. See the docs for details, particularly
perldoc perlvar.
>
> It reads the data file that is in this format:
>
> EXAMPLE #1 - Current format of data file:
> 0000000050 20041227 0000000003 'my-page.shtml'
> 0000000054 20041227 0000000004 'another-page.shtml'
> 0000000020 20041227 0000000003 'yet-another-page.shtml'
>
> I need it to perform the same function on a data file in this format:
>
> EXAMPLE #2 - New format of data file:
> 0000000050|20041227|0000000003|my-page.shtml
> 0000000054|20041227|0000000004|another-page.shtml
> 0000000020|20041227|0000000003|yet-another-page.shtml
>
If you insist on a regexp to match the above, try:
if(($acc,$day,$dayacc,$uri)=
($line=~/^(\d+)\|(\d+)\|(\d+)\|(\S+)$/)) {
Note that | is a regexp metacharacter and thus literal instances
of it must be escaped with the \ metacharacter or equivalent.
> Based on the way this program works, my guess is that $uri is being compared
> with the data inside the quotes '(\S+)' taken from the current line of the
> data file. Right?
Yes, if the match succeeds.
>
> I appreciate your help and any further advice you or anyone else can offer.
My advice is: read and study the documentation that is already
on your computer. It is wonderful stuff, and is where all the
answers may be found. And found more quickly than asking on a
newsgroup, where folks are generally not too willing to
regurgitate the docs in specific detail.
>
> - Patrick
>
HTH.
-- Bob Walton Email: http://bwalton.com/cgi-bin/emailbob.pl ----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==---- http://www.newsfeeds.com The #1 Newsgroup Service in the World! >100,000 Newsgroups ---= East/West-Coast Server Farms - Total Privacy via Encryption =---
- Next message: Gerald Meazell: "Newbie question on require and semaphores"
- Previous message: George Cox: "Re: Is zero even or odd?"
- In reply to: GRLCOPM: "Re: Help needed for perl rookie"
- Next in thread: Jim Keenan: "Re: Help needed for perl rookie"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|