problem with splitting on "words"

From: Charlotte Hee (chee_at_slac.stanford.edu)
Date: 07/30/04


Date: Fri, 30 Jul 2004 09:59:17 -0700 (PDT)
To: beginners@perl.org


Hello All,

I am having trouble splitting words from titles from a list of research
papers. I thought I could split the title into words like so:

  #!/usr/local/bin/perl
  use locale;

  %forums = ( 1 => 'B0->K+K-Ks',
              2 => 'B+->K+KsKs Decays',
              3 => 'Measurement of the Total Width',
              4 => 'Asymmetries in B0->K0s pi0 Decays'
  );

  foreach $forum ( sort keys %forums ){
     my $title = $forums{$forum};
     foreach $w (split /[^\w-]+/, $title) {
        next unless ($w =~ /^[A-Za-z]/);
        $title =~ /\b\Q$w\E\b/;
        print "Journal $forum indexed word = " . ucfirst($w) . "\n";
      }
  }

exit;

But the results show that I'm losing some characters:

Journal 1 indexed word = B0- # this should be B0->
Journal 1 indexed word = K # what happened to the '+'?
Journal 1 indexed word = K-Ks

Journal 2 indexed word = B # '+->' missing
Journal 2 indexed word = K # '+' missing
Journal 2 indexed word = KsKs
Journal 2 indexed word = Decays

Journal 3 indexed word = Measurement
Journal 3 indexed word = Of
Journal 3 indexed word = The
Journal 3 indexed word = Total
Journal 3 indexed word = Width

Journal 4 indexed word = Asymmetries
Journal 4 indexed word = In
Journal 4 indexed word = B0- # should be 'B0->'
Journal 4 indexed word = K0s
Journal 4 indexed word = Pi0
Journal 4 indexed word = Decays

These are only example titles but the other titles have similar characters
in them as part of a "word". I tried adding the '-' and '>' to my character
class but that did not work. What am I doing wrong here?

thanks, Chee