Re: text processing problem
- From: "Paul McGuire" <ptmcg@xxxxxxxxxxxxx>
- Date: 7 Apr 2005 19:32:50 -0700
Maurice -
Here is a pyparsing treatment of your problem. It is certainly more
verbose, but hopefully easier to follow and later maintain (modifying
valid word characters, for instance). pyparsing implicitly ignores
whitespace, so tabs and newlines within the expression are easily
skipped, without cluttering up the expression definition. The example
also shows how to *not* match "<X> (<X>)" if inside a quoted string (in
case this becomes a requirement).
Download pyparsing at http://pyparsing.sourceforge.net.
-- Paul
(replace leading '.'s with ' 's)
from pyparsing import *
LPAR = Literal("(")
RPAR = Literal(")")
# define a word as beginning with an alphabetic character followed by
# zero or more alphanumerics, -, _, ., or $ characters
word = Word(alphas, alphanums+"-_$.")
targetExpr = word.setResultsName("first") + \
.............LPAR + word.setResultsName("second") + RPAR
# this will match any 'word ( word )' arrangement, but we want to
# reject matches if the two words aren't the same
def matchWords(s,l,tokens):
.....if tokens.first != tokens.second:
.........raise ParseException(s,l,"")
.....return tokens[0]
targetExpr.setParseAction( matchWords )
testdata = """
This is (is) a match.
This is (isn't) a match.
I.B.M.\t\t\t(I.B.M. ) is a match.
This is also a A.T.T.
(A.T.T.) match.
Paris in "the(the)" Spring( Spring ).
"""
print testdata
print targetExpr.transformString(testdata)
print "\nNow don't process ()'s inside quoted strings..."
targetExpr.ignore(quotedString)
print targetExpr.transformString(testdata)
Prints out:
This is (is) a match.
This is (isn't) a match.
I.B.M. (I.B.M. ) is a match.
This is also a A.T.T.
(A.T.T.) match.
Paris in "the(the)" Spring( Spring ).
This is a match.
This is (isn't) a match.
I.B.M. is a match.
This is also a A.T.T. match.
Paris in "the" Spring.
Now don't process ()'s inside quoted strings...
This is a match.
This is (isn't) a match.
I.B.M. is a match.
This is also a A.T.T. match.
Paris in "the(the)" Spring.
.
- References:
- text processing problem
- From: Maurice LING
- Re: text processing problem
- From: Matt
- Re: text processing problem
- From: Maurice LING
- Re: text processing problem
- From: Matt
- Re: text processing problem
- From: Maurice LING
- text processing problem
- Prev by Date: Re: How to name Exceptions that aren't Errors
- Next by Date: Sockets
- Previous by thread: Re: text processing problem
- Next by thread: Re: text processing problem
- Index(es):