Re: how to get all repeated group with regular expression
- From: MRAB <google@xxxxxxxxxxxxxxxxxxx>
- Date: Sat, 22 Nov 2008 15:22:28 +0000
scsoce wrote:
MRAB wrote:Why not capture just the "<td>" entries?<div class="moz-text-flowed" style="font-family: -moz-fixed">Steve Holden wrote:Yes, you are right, but this way findall() capture only the 'top' group. What I really need to do is to capture nested and repated patterns, say, <table> tag in html contains many <tr>, <tr> contains many <td>, the data in <td> is i need, so I write the regx like this:Please keep this on the list.Nor Perl.
scsoce wrote:Steve Holden wrote:'Fraid the Python re implementers just decided not to do it that way.scsoce wrote:maybe my expression was not clear. I want to capture every matched part
say, when I try to search and match every char from variable lengthI think you will find you missed a quote out there. Always better to
string, such as string '123456', i tried re.findall( r'(\d)*, '12346' )
copy and paste ...
, but only get '6' and Python doc indeed say: "If a group is containedSo use
in a part of the pattern that matched multiple times, the last match is
returned."
r'(\d*)'
instead and then the group includes all the digits you match.
cause the regx engine cannot remember all the past history then ? is itDifferent regex engines have different capabilities, so I can't speak to
nature to all regx engine or only to Python ?
them all. If you wanted *all* the matches of *all* groups, how would you
have them returned? As a list? That would make the case where there was
only one match much tricker to handle. And what would you do with
r'((\w)*\d)*)'
Also, what about named groups? I can see enough potential implementation
issues that I can perfectly understand why Python works the way it does,
so I'd be interested to know why it doesn't makes sense to you, and what
you would prefer it to do.
regards
Steve
in a repeated pattern, not only the last, say, for string '123456', I
want to back reference any one char, not only the '6'. and i know the
example is very simple, so we can got the whole string using regx and
get every char using other python statements, but if the pattern in
group is complex?
and I test in VIM, it can do the 'back reference':
==you text in vim:
123456
== pattern:
:%s/\(\d\)*/$2
text will turn to be:
2
Probably what you want is re.findall(r"(\d)", "123456"), which returns a list of what it captured.
</div>
regx ='''
<table.*\n
(
(\s*<tr.*\n
(\s*<td.*</td>\n|\n)*
\s*</tr>\n
|\n)*
)
\s*</table>
'''
Steve Holden wrote:I can see enough potential implementation
issues that I can perfectly understand why Python works the way it does,
so I'd be interested to know why it doesn't makes sense to you, and what
you would prefer it to do.
As Steve said, if re really cannot do this kind of work , so I have to split the one line regx down, and capture <table> first, and then loop to catpure <tr>, and then <td>, and so on ... . I donnot like this way compared with the above one clean regx line.
If you want to know when it's starting a new table or row then how about:
re.compile(r'(<table\b|<tr\b|<td[^<]*)')
and re.findall() or re.finditer()?
If what was captured starts with "<table>" then it's the start of a new table; if what was captured starts with "<tr" then it's the start of a new row; if what was captured starts with "<td" then it's an entry.
.
- Prev by Date: Re: Module Structure/Import Design Problem
- Next by Date: Re: how to dynamically instantiate an object inheriting from several classes?
- Previous by thread: Re: Re: how to get all repeated group with regular expression
- Next by thread: RELEASED Python 3.0rc3
- Index(es):
Relevant Pages
|