spell checking...capitalization of proper names
- From: JackL <jbl02NO@xxxxxxxxxxxxxx>
- Date: Sat, 28 Jul 2007 06:54:48 -0500
I am looking for an efficient way to check for capitalization of
proper names. I am currently using a perl script with quite a lot of
regexes and it works fine, but.... Rather than re-invent the wheel, I
thought there probably was something already out there, or maybe just
a better method of doing it. I do this each day and find that I am
adding 15-30 new words to the regex list each day.
Currently the script has 1500 regexes looking for individual words or
2-3 word phrases. (New York City, for example)
$data =~ s/[ ][Nn]ew([ ]*)(\[\d+:\d+:\d+][ ]*)*[Yy]ork([
]*)(\[\d+:\d+:\d+][ ]*)*[Cc]ity/ New$1$2York$3$4City/msg;
The text is actually a television newscast closed captioning script
that is formatted like the following lines. 70% local Kansas City, MO
news and 30% national US news.
<snip>
[17:16:32] Since the first gulf war --
[17:16:33] Cliff Standby has been
[17:16:35] receiving knee treatment at Walter
[17:16:36] Reed Army Medical
[17:16:38] Center. And within the last
<snip>
The goofy looking first regex below will look for 'walter reed' either
on 1 line or split across two lines. The remainder of them just simply
check for a certain word and capitilize them.
$data is the text of the entire file. The files average 500-800 lines
of text.
<snip>
$data =~ s/[ ][Ww]alter([ ]*)(\[\d+:\d+:\d+][ ]*)*[Rr]eed/
Walter$1$2Reed/msg;
$data =~ s/[ ]ward([ .!?:,;\'-])/ Ward$1/msg; $data =~ s/[ ]warner([
..!?:,;\'-])/ Warner$1/msg; $data =~ s/[ ]warren([ .!?:,;\'-])/
Warren$1/msg; <snip>
I have just begun to look at
http://search.cpan.org/~hank/Text-Aspell/Aspell.pm
It appears that it will just take a string, a word at a time and
check, then suggest, correct or incorrect. Maybe I am not grasping
it's capabilities.
I would appreciate any suggestions.
jbl
.
- Prev by Date: Re: Reading from stdin then launching a program that reads from stdin strange behaviour
- Next by Date: Re: Reading from stdin then launching a program that reads from stdin strange behaviour
- Previous by thread: Block of statements after "open" and before "die"
- Next by thread: getting arguments
- Index(es):
Relevant Pages
|
|