Regex matching non-contiguous sheds of text

From: DM (elektrophyte-yahoo)
Date: 10/20/04


Date: Wed, 20 Oct 2004 10:26:17 -0700

I'm trying to design a regular expression to match the href attribute of <a>
tags. I'm testing it on the command line (on Redhat Linux Enterprise Server)
using grep with the Perl regex option.

Here's the command I'm using:

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/

(On my console, the above is all one line. The URL part --
"TEA-21_Side-by-Side\.pdf" in this example, would be determined at runtime in
the actual Perl script.)

It almost works as expected. I set the color and -o options in order to clearly
show the highlighted match. In most cases it *does* match exactly what I want it to.

However, in a few cases what is matched is totally unexpected.

Here is some sample output:

================================================================================

# grep -rHInPo --color=auto 'href=.*TEA-21_Side-by-Side\.pdf[^>]*>'
/home/mtc_website/
/home/mtc_website/whats_happening/legislative_update/tea21_04-04.htm:43:href="TEA-21_Side-by-Side.pdf">
/home/mtc_website/whats_happening/legislative_update/tea21_06-04.htm:42:href="TEA-21_Side-by-Side.pdf">
                     <li> <a href="TEA-21_Side-by-Side.pdf">rong>
<ul>ng="5">.ca.gov</a> s tober LATIVE UPDATE" width="340" height="14" border="0" />

================================================================================

In the file "tea21_06-04.htm" it's going beyond what I indend to match and
scooping up a bunch more stuff. But it isn't even clear to me what it's matching
because the output shows discontinuous shreds of text from within the file.

Here is a sample of that file containing the unexpected match:

================================================================================

               <td bgcolor="#CCFFFF"><strong>DOWNLOAD:</strong> <ul>
                   <li> <a href="TEA-21_Side-by-Side.pdf">Comparison of Highway
                     Provisions in Surface Transportation Reauthorization Bills</a>
                     (PDF)
                     <p> </p>
                   </li>
                   <li><a href="HR3550-High-Priority_Proj.xls">H.R. 3550
High-Priority
                     Projects</a> (Excel)<br />
                   </li>
                 </ul></td>
             </tr>
           </table>
           <p><br />
             <strong>TEA 21 Reauthorization Conference Committee Comes Closer to
             Agreement on Bottom Line Number</strong><br />

================================================================================

Any help would be greatly appreciated.

Thanks,

dm



Relevant Pages


Loading