Regex matching non-contiguous sheds of text
From: DM (elektrophyte-yahoo)
Date: 10/20/04
- Next message: Sherm Pendley: "Re: perl to english"
- Previous message: wana: "Re: perl to english"
- Next in thread: Jon Ericson: "Re: Regex matching non-contiguous sheds of text"
- Reply: Jon Ericson: "Re: Regex matching non-contiguous sheds of text"
- Reply: Paul Lalli: "Re: Regex matching non-contiguous sheds of text"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 20 Oct 2004 10:26:17 -0700
I'm trying to design a regular expression to match the href attribute of <a>
tags. I'm testing it on the command line (on Redhat Linux Enterprise Server)
using grep with the Perl regex option.
Here's the command I'm using:
# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/
(On my console, the above is all one line. The URL part --
"TEA-21_Side-by-Side\.pdf" in this example, would be determined at runtime in
the actual Perl script.)
It almost works as expected. I set the color and -o options in order to clearly
show the highlighted match. In most cases it *does* match exactly what I want it to.
However, in a few cases what is matched is totally unexpected.
Here is some sample output:
================================================================================
# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/
/home/mtc_website/whats_happening/legislative_update/tea21_04-04.htm:43:href="TEA-21_Side-by-Side.pdf">
/home/mtc_website/whats_happening/legislative_update/tea21_06-04.htm:42:href="TEA-21_Side-by-Side.pdf">
<li> <a href="TEA-21_Side-by-Side.pdf">rong>
<ul>ng="5">.ca.gov</a> s tober LATIVE UPDATE" width="340" height="14" border="0" />
================================================================================
In the file "tea21_06-04.htm" it's going beyond what I indend to match and
scooping up a bunch more stuff. But it isn't even clear to me what it's matching
because the output shows discontinuous shreds of text from within the file.
Here is a sample of that file containing the unexpected match:
================================================================================
<td bgcolor="#CCFFFF"><strong>DOWNLOAD:</strong> <ul>
<li> <a href="TEA-21_Side-by-Side.pdf">Comparison of Highway
Provisions in Surface Transportation Reauthorization Bills</a>
(PDF)
<p> </p>
</li>
<li><a href="HR3550-High-Priority_Proj.xls">H.R. 3550
High-Priority
Projects</a> (Excel)<br />
</li>
</ul></td>
</tr>
</table>
<p><br />
<strong>TEA 21 Reauthorization Conference Committee Comes Closer to
Agreement on Bottom Line Number</strong><br />
================================================================================
Any help would be greatly appreciated.
Thanks,
dm
- Next message: Sherm Pendley: "Re: perl to english"
- Previous message: wana: "Re: perl to english"
- Next in thread: Jon Ericson: "Re: Regex matching non-contiguous sheds of text"
- Reply: Jon Ericson: "Re: Regex matching non-contiguous sheds of text"
- Reply: Paul Lalli: "Re: Regex matching non-contiguous sheds of text"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|