Re: beginners Digest 13 Mar 2006 05:54:58 -0000 Issue 2791



On 3/13/06, Martin <martin@xxxxxxxxxxxxxx> wrote:
Hi,

I am trying out to get a an attribute value from the given text. Kindly help me
out in this regard.

Input text:

The Bill for this Act of the Scottish Parliament was passed by the Parliament on
15th December 2005 and received Royal Assent on 20th January 2006

Output needed:

<assent date="20060120">The Bill for this Act of the Scottish Parliament was
passed by the Parliament on 15th December 2005 and received Royal Assent
on 20th January 2006</assent>

Note: The date attribute should be recovered from the highlighted text.

Regards,
Martin.
snip

You should avoid trying to send formatted text to a mailing list
(there is no highlighted text). I assume you mean to get the date
from the text "received Royal Assent on 20th January 2006". This is
easy enough to do with a regular expression and a couple of hashes to
map text like "January" to a formatted number like 01. Your largest
problem (if this is a real world application and not a puzzle or
homework assignment) is that normal text is never this clean.

Here is how I would go about writing the regex needed to pull the information

First we need to identify the parts of the string:
1. "received Royal Assent on "
2. "20"
3. "th"
4. " "
5. "January"
6. " "
7. "2006"

Part 1 seems to be constant.
Part 2 seems to be a one or two digit number representing the day of month
Part 3 seems to be irrelevant, I think we just want to get rid of it
Part 4 seems to be constant.
Part 5 seems to be the month spelled out
Part 6 seems to be constant.
Part 7 seems to be a four digit year.

Next we need to identify the parts we want to capture. We want the
day, month, and year so that would be parts 2, 5, and 7. These will
need to be capture groups in the eventual regex.

Now that we know what the parts are lets start writing some regexs to match them

Part 1 should just match the whole string: /received\sRoyal\sAssent\son\s/
Part 2 needs to match one or more digits (I am going to assume that
these will fall into the right range 01 - 12, but the regex could take
this into account as well): /\d{1,2}/
Part 3 seems to be two characters long and we don't care about it. It
should be /../ or possibly /.*?/ if it isn't always two characters
long.
Part 4 is just a space /\s/
Part 5 is either a set constants or a word depending on how much
validation you want to do:
/January|February|March|April|May|June|July|August|September|October|November|December/
or /\w+/ respectively.
Part 6 is another space: /\s/
Part 7 is a four digit year: /\d{4}/

Now we just need to combine the individual regexes and capture the
parts we want:

/received\sRoyal\sAssent\son\s(\d{1,2})..\s(\w+)\s(\d{4})/

Now that we know what the regex is we can add the tags:

my %month = (
January => '01',
February => '02',
March => '03',
April => '04',
May => '05',
June => '06',
July => '07',
August => '08',
September => '09',
October => '10',
November => '11',
December => '12'
);

my %day = map { $_ => sprintf "%2.2d" $_ } 1 .. 31;

if ($str =~ /received\sRoyal\sAssent\son\s(\d{1,2})..\s(\w+)\s(\d{4})/) {
$str=qq(<assent date="$3$month{$2}$day{$1}">$str</assent>);
} else {
print stderr "could not find date this bill recieved Royal Assent: $str";
}
.



Relevant Pages

  • Re: Regex optimization
    ... I was hoping that someone with knowledge of the Regex engine could ... match per string for either Regex. ... reluctant modifier, may be slower .*?, +? ... Variable parts will try to capture as much as possible. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Regex optimization
    ... I was hoping that someone with knowledge of the Regex engine could ... reluctant modifier, may be slower .*?, +? ... Variable parts will try to capture as much as possible. ... The engine will again try to see if the next character is a B. It ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Regex Capture problem
    ... "learned" my regex using a freeware utility that had slightly different ... was trying to capture instead of. ... I have used Regex utilities before, so I understand the concepts of text ... Function RESub(str As String, SrchFor As String, ReplWith As String) As String ...
    (microsoft.public.excel.programming)
  • Re: RegExp
    ... On a related note, the following regex is intended to capture "IN (23, ... Set myMatch = myMatches ... your regex will not do what you want. ... So, matching as much as it can, that will include all the digits except for the ...
    (microsoft.public.excel.programming)
  • Re: Regex Capture problem
    ... Here are two regex search and replace strings. ... RgxReplaceText = objRgx.Replace(strRgxInput, strRgxOutput) ... capture and re-use in a Regex expression. ...
    (microsoft.public.excel.programming)