Re: Non-uniform split



thisismyidentity@xxxxxxxxx wrote:
Hi all,
I am writing a Perl script that should parse each line of a file (which
unfortunately I cant modify) and split the line. The main problem is
that every line (nearly 10000 lines) of the file is not uniform. So
there doesnt seem to be a pattern or a delimiter on which I can simply
split the line and could do it in a loop over all lines :(.
Here is an example:
========================
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
.
.
========================

My main intention is to extract the values in Column A, B,C..into an
array but since in some lines some values under a column may not be
present..I am unable to have a single regex on which i can split all
lines in a loop. I tried the (obvious) \s+ regex for splitting but
since the columns that r empty have spaces, I get different results for
a particular column on different lines. I am especially interested in
two columns for which it is guaranteed that each line will be non-empty
(like A,B,D) but coz of other empty columns cant get them on a
particular index of the array which is returned by split().

I'm just assuming now that column D is always "W" followed by
another capital letter, for my suggestion to work you need some
unique criteria for column D that lets you anchor your regex there:

my @fields = $line =~ /(\S+)\s+(\S+)\s+(\S*)\s*(W[A-Z])\s*(\S*)$/;

The first two non-whitespace groups should be self explanatory,
the third group (and the following whitespaces) might be absent
and therefore match an empty spot (asterisk). Column D is always
present, so we have an anchor here, and the following whitespaces
and fields may again match empty strings up to the end of the line.

HTH
-Chris
.



Relevant Pages

  • Re: 1st line w/regex only
    ... >> this is so obvious to you that it didn't seem worth mentioning, ... I infer it can only be empty. ... I look forward to Alan's next question when he has the "sed and awk" ...
    (comp.unix.shell)
  • Re: Non-uniform split
    ... I am writing a Perl script that should parse each line of a file (which unfortunately I cant modify) and split the line. ... I tried the \s+ regex for splitting but since the columns that r empty have spaces, I get different results for a particular column on different lines. ... I am especially interested in two columns for which it is guaranteed that each line will be non-empty but coz of other empty columns cant get them on a particular index of the array which is returned by split. ... Is there any other way apart from split by which i cud achieve this (assuming that there is no single regex to spit on)? ...
    (comp.lang.perl.misc)
  • Re: Non-uniform split
    ... unfortunately I cant modify) and split the line. ... present..I am unable to have a single regex on which i can split all ... since the columns that r empty have spaces, ... particular index of the array which is returned by split. ...
    (comp.lang.perl.misc)
  • Re: Using a regular expression to retrieve the text between two parentheses
    ... the parens. ... That returns an empty string... ... I compared a regex to find all tokens between ... a regex can save both programmer time and ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Limit on "fmt" buffer size? Loose much of long line.
    ... > Hi, Christian, ... > to put in some empty lines too. ... Unfortunately, you cant use a regex for this, so you cant ...
    (comp.unix.programmer)