Re: Regular Expression - old regex module vs. re module



In article <1151707747.611123.65800@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
Steve <stever@xxxxxxxxxx> wrote:
Hi All!

Thanks for your suggestions and comments! I was able to use some of
your code and suggestions and have come up with this new version of
Report.py.

Here's the updated code :

exponentPattern = re.compile('\(^\|[^\\#]\)|#+\.#+\*\*\*\*')
floatPattern = re.compile('\(^\|[^\\#]\)|#+\.#+')
integerPattern = re.compile("\(^\|[^\\#]\)|\##+")
leftJustifiedStringPattern = re.compile('\(^\|[^\\<]\)|\<<+')
rightJustifiedStringPattern = re.compile('\(^\|[^\\>]\)|\>>+')


Some comments and suggestions

If you want to include backslashes in a string, either
use raw strings, or use double backslashes (raw strings are much
easier to read). Otherwise, you have an accident waiting to happen -
'\(' _does_ make a two character string as you desired, but '\v' makes
a one character string, which is unlikely to be what you wanted.

That's a stylistic point. But more serious is the leading part of all
your regexes - '\(^\|[^\\#]\)|'. I'm not sure what you're trying to
accomplish - presumably to skip over escaped format characters, but
that's not what they do.

\( says to look for an open parenthesis (you've escaped it, so it's
not a grouping character. ^ says look at the start of the line. This
means the first character can never match an open parens, so this
entire term, up to the alternate expression (after the non-escaped
pipe symbol) never matches anything. If you want to ignore escaped
formating characters before a format, then you should use a negative
lookbehind assertation (see the library reference, 4.2.1, Regular
Expression Syntax:

'(?<!\\)'

This says that the match to the format can't start
immediately after a backslash. You need to make a final
pass over your format to remove the extra backslashes, otherwise they
will appear in the output, which you do, but you replace them with
spaces, which may not be the right thing to do - how could you output
a line of like '################' as part of a template?

Other odds and ends comments interleaved here

###################################################################
# _make_printf #
###################################################################

def _make_printf(s):
"""Convert perl style format symbology to printf tokens.

Take a string and substitute computed printf tokens for perl style
format symbology.

For example:

###.## yields %6.2f
######## yields %8d
<<<<< yields %-5s
"""
# print("Original String = %s\n\n") % (s)


while 1: # process all sci notation fields
if exponentPattern.search(s) < 0: break
i1 , i2 = exponentPattern.search(s).span()
width_total = i2 - i1
field = s[i1:i2-4]
width_mantissa = len( field[string.find(field,'.')+1:] )
f = '%'+`width_total`+'.'+`width_mantissa`+'e'
s = exponentPattern.sub(f, s, 1)

There are better ways to examine a match than with span() - consider
using grouping in your regex to get the mantissa width and the total
width, use a regex with grouping like this:

If:

exponentPattern = re.compile(r'(?<!\\)(#+\.(#+)\*\*\*\*'))

then you could do this:
m = re.match(exponentPattern, s)
if m:
s = exponentPattern.sub("%%%d.%de" % (len(m.groups()[0],
len(m.groups()[1]), s, 1)

m.groups()[0] will be the entire '#+\.#\*\*\'*\*') match, in other
words the field width to be printed

m.groups()[1] will be the string after the decimal point, not
inclouding the '*'s

In my opinion, building the string by using the sprintf like '%'
formatting operator, rather than adding together a bunch of substrings
is easier to read and maintain.

Similar use of grouping can be done for the other format string types.


s = re.sub('\\\\', ' ', s)
return s

As I noted, should backslashes be converted to spaces? And again, it's
easier to type and read if it uses raw strings:

s = re.sub(r'\\', ' ', s)


###################################################################
# ReportTemplate #
###################################################################

class ReportTemplate:
"""Provide a print formatting object.
[Snip]
The total width of the symbol and it's decimal point position is
used to compute the appropriate printf token; see 'make_printf'
method. The symbol must have at least two adjacent characters for

A minor nit - make_printf is better described as a function, not a
method (it's not part of the ReportTemplate class or any other cleass)

[SNIP]

def __init__( self, template = ''):
self.body = []
self.vars = []

#read in and parse a format template
try:
tpl = open(template, 'r')
lines = string.split(tpl.read(), '\n')[:-1]

You'd be better off to use something like:

lines = []
for l in open(template, 'r').readlines():
lines.append(l.rstrip)

The use of rstrip discards any trailing whitespace and deals with
reading Windows generated files on a Unix box, where lines will end in
CRLF and you'd strip only the LF

except IOError:
lines = string.split(template, '\n')

I have my doubts about the advisability of assuming that you are
either passed the name of a file containing a template or a
template itself. A misspelled file name won't raise
an error, it will simply be processed as a fixed output. I would have
passed a flag to say if the template argument was file name or a
template and terminated with an error if it was a non-existant file.

[SNIP]

def _format( self, dataobj ):
"""Return the values of the given data object substituted into
the template format stored in this object.
"""
# value[] is a list of lists of values from the dataobj
# body[] is the list of strings with %tokens to print
# if value[i] == None just print the string without the %
argument
s = ''
value = []

for i in range(self.nrows):
value.append([])

for i in range(self.nrows):
for vname in self.vars[i]:
try:
if string.find(vname, '[') < 0:
# this is the nominal case and a simple get
will be faster
value[i].append(getattr(dataobj, vname))
else:
# I use eval so that I can support sequence
values
# although it's slow.
value[i].append(eval('dataobj.'+vname))
except AttributeError, SyntaxError:
value[i].append('')

There's another way to do this - use getattr to retrieve the sequence,
then use __getitem__ to index it

Something like this would work, again using a regex to separate out
the index (you might want the regex compiled once at __init__
time). The regex looks for a run of characters valid in a python
variable name followed by an optional integer (with or without sign)
index in square brackets. m.groups()[0] is the variable name portion,
m.grousp()[1] is the index if there's a subscripting term.

try:
m = re.match(r'([a-zA-Z_][a-zA-Z0-9._]*)\s*(?:\[\s*([+-]?\d+)\s*\])?\s*',
value[i])
if not m: raise SyntaxError
if m.groups()[1] is None:
value[i].append(getattr(dataobj, vname))
else:
value[i].append(getattr(m.groups()[0]).\
__getitem__(int(m.groups()[1])))
except AttributeError, SyntaxError, IndexError:
value[i].append('')

This is a bit ugly, but avoids eval with all the dangers it carries -
a deliberately hacked template file can be used to do a lot of damage,
a badly written one could be hard to debug

if value[i][0] != '':
try:
temp_vals = []
for item in value[i]:
# items on the list of values for this line
# can be either literals or lists
if type(item) == ListType:

Might you be better off asking if item has a __getitem__? It would
then work with tuples and Extending to dictionaries would then be easier

def isReportTemplate(obj):
"""Return 1 if obj is an instance of class ReportTemplate.
"""
if type(obj) == InstanceType and \
string.find(`obj.__class__` , ' ReportTemplate ') > -1:
return 1
else:
return 0

Why not just use isinstance(obj, classname)?

###################################################################
# ColumnReportTemplate #
###################################################################

class ColumnReportTemplate:
"""This class allows one to specify column oriented output formats.



--
Jim Segrave (jes@xxxxxxxxxxxxxx)

.



Relevant Pages

  • Proposal: String::Format::General
    ... It provides format string parsing and output assembly, you provide the code that implements the individual conversion characters. ... Format syntax is kind of a cross between sprintf and strftime, but how close it is to each of these depends on the semantics implemented by the user. ... Note that the following is pre-alpha documentation; the interface to the output conversion code has changed since yesterday, ... conversion character, and contain a number of optional fields which may ...
    (comp.lang.perl.modules)
  • Re: Regex not working as expected when doing multiple checks and updates
    ... > the reports I use for the emails from one system and place on my test node. ... If you want to translate one character to another character ... Your regular expression is anchored at the beginning of the string but it will ... the format string will not work properly if there are format meta-characters ...
    (perl.beginners)
  • Re: Tilde-slash directive with no arguments?
    ... The simplest solution is to just put a tab into the string. ... other languages, Lisp strings can contain any legal character, including ... So just put a tab character directly into the format string: ...
    (comp.lang.lisp)
  • Re: converting a string to a list
    ... format of the input data, ... languages is that you can treat the compiler as a black box, ... I input a string, the ... text character by character, or line by line. ...
    (comp.lang.lisp)
  • Re: How do I stop FrontPage Auto Format my codes?
    ... To find the best Newsgroup for FrontPage support see: ... | i have a string that looks like this ... ... | of text in string format, ... |> FrontPage does sometimes change character representations from one form to ...
    (microsoft.public.frontpage.client)