Re: Help me understand use of regular expressions to validate data



Ted <r.ted.byers@xxxxxxxxxx> wrote:

The context here is I need to create a script that validates data in


The common idiom for validating data is:

anchor the start
anchor the end
write a pattern in between that accounts for everything that
you want to allow

Then if the pattern matches the string, valid data, else invalid data.


fields in plain text files where fields may be surrounded by double
quotes and may be separated by commas or tabs. In fact, one supplier
of a data feed we use has been known to switch between comma separated
values and tab delimited values, often without warning.


In that case, I would attempt to detect what separator is being
used, then normalize it before proceeding to splitting out the
fields for individual validation.


In one of the FAQs, I found the following regular expressions, but I
have some questions.

if (/\D/) { print "has nondigits\n" }
if (/^\d+$/) { print "is a whole number\n" }
if (/^-?\d+$/) { print "is an integer\n" }
if (/^[+-]?\d+$/) { print "is a +/- integer\n" }
if (/^-?\d+\.?\d*$/) { print "is a real number\n" }
if (/^-?(?:\d+(?:\.\d*)?|\.\d+)$/) { print "is a decimal number\n" }
if (/^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/)
{ print "a C float\n" }

The first question is "What string is the regular expression applied
to?"


You should check Perl's std docs *before* posting to the Perl newsgroup.

The description of the m// operator in perlop.pod says what string
will be searched by default, and how to make it look somewhere
besides that default place if you wish to.

If no string is specified via the =~ or !~ operator,
the $_ string is searched.


I can recognize '\d+' as representing an arbitrary number of digits,


It does not match zero digits, so not quite an "arbitrary number".


but what are '^' and '$' for ?


Once again, going to the docs is faster, more authoritative, and
helps you to avoid wearing out your welcome before you get to
questions that cannot be answered by a cursory search of the
documentation.

perldoc perlre

^ Match the beginning of the line
$ Match the end of the line (or before newline at the end)


(my code below ignores that parenthetical, \z might be better
than $ for your application...)


From what I have read, I expect I can use '\w' to test whether or not a
variable contains a string consisting only of alpha numeric characters.
^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^
^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^
Is that right?


No.

First it does not match only alphanumerics, just as perlre.pod says:

\w Match a "word" character (alphanumeric plus "_")

Secondly, \w can be used to test if the string (which may not be
in a "variable") _contains_ an alphanumeric or "_" character.

To get to "consisting only of", you need to apply the idiom:

/^\w+$/

or

/^[a-zA-Z0-9_]+$/

or, if you really want only alphanumerics

/^[a-zA-Z0-9]+$/


What would I use to test, using a regular expression,
whether a given string contains only alphanumeric characters, and that
the total number of characters is less than or equal to 8?


/^\w{0,8}$/

but your spec is probably incomplete, so I think you probably want:

/^\w{1,8}$/

instead.


What about
testing for a string containing precisely 4 letters and 3 digits?


One part regex, two parts NOT a regex:

/^[a-zA-Z0-9]{7}$/ and tr/a-zA-Z// == 4 and tr/0-9// == 3


I will also need to be able to check to see whether or not a given
string represents a valid date or timestamp.


You are going to need to give more precise criteria for "valid" here.

In most of _my_ applications I usually use:

/^\d\d\d\d-\d\d-\d\d$/

and call it good enough.

If you want 2006-02-30 or 2006-13-01 to be invalid, or if you want
\d\d\d\d-02-29 to be valid for some years and invalid for other
years, then I'd start looking for a module on CPAN...


I still haven't decided how to handle the
fact that one of our suppliers sometimes switches between commas and
tabs, sometimes without warning. Suggestions are welcome, though.


Insufficient information.

When commas are used, can you have commas in fields?

When tabs are used, can you have tabs in fields?

If the format allows seperators in quoted fields, then how are
quotes represented in quoted fields?

Is there a fixed and expected number of fields in a record?

If not, then can you at least expect the _same_ number of fields
in any particular file?



You can perhaps "guess".

Read the first 10 or 20 records and calculate the tabs/commas ratio
for each, then see if most of the ratios are are greater or less
than one.

Certainly not robust or fool-proof, but would probably work on most data...


Sorry if this seems basic,


"basic" is nothing to apologize for. There is no "minimum complexity"
expected for posting here.

Asking things that can be answered straightaway by a cursory search
of Perl's standard documentation however is another matter.

Have you seen the Posting Guidelines that are posted here frequently?


but it has been eons since I last looked at
regular expressions, and I have not found sufficient detail in the
documentation I have found.


If you tell us what documentation you have found, then we might be
able to tell you about some that you have not found...

Have you found "perlop.pod" and "perlre.pod" for instance?


See also:

perldoc perlrequick

perldoc perlretut

--
Tad McClellan SGML consulting
tadmc@xxxxxxxxxxxxxx Perl programming
Fort Worth, Texas
.



Relevant Pages

  • F$GETJPI Lexical Function (Rights Item Codes)
    ... It seems that the explaination of some item codes for F$GETJPI is not ... RIGHTSLIST and SYSTEM_RIGHTS are documented as string ... the documentation is still wrong, because a hex-value is not an ... "Integer" (and the statement "...separated by commas" still doesn't ...
    (comp.os.vms)
  • Re: John Resig has a new idea
    ... empty styleSheets collection, ... documentation, the behavior is unexpected. ... the property name is an "integer index", ... This index would be provided as a string for property access. ...
    (comp.lang.javascript)
  • RE: Help with replacement pattern
    ... replace the commas which are not contained within a pair of quotes. ... After all the "" pairs are matched, any commas in the remaining string ... bool isInQuotes = false; ... Microsoft Online Community Support ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Anyone heard of Bee Lisp?
    ... so I took a quick look at the documentation. ... CAR/CDR/CONS - seems to work on lists instead of cons cells. ... QUOTE - Prevents evaluation of the parameter. ... STRREAD - reads a string from the console ...
    (comp.lang.lisp)
  • Re: another docs problem - imp
    ... > The first sentence says that the _path_ argument is a search path. ... Correct it does not say it's a string. ... > technical documentation a bit more carefully. ... Yes, read it a second time, didn't see the alternate interpretation. ...
    (comp.lang.python)