Help me understand use of regular expressions to validate data

T

Ted

The context here is I need to create a script that validates data in
fields in plain text files where fields may be surrounded by double
quotes and may be separated by commas or tabs. In fact, one supplier
of a data feed we use has been known to switch between comma separated
values and tab delimited values, often without warning.

In one of the FAQs, I found the following regular expressions, but I
have some questions.

if (/\D/) { print "has nondigits\n" }
if (/^\d+$/) { print "is a whole number\n" }
if (/^-?\d+$/) { print "is an integer\n" }
if (/^[+-]?\d+$/) { print "is a +/- integer\n" }
if (/^-?\d+\.?\d*$/) { print "is a real number\n" }
if (/^-?(?:\d+(?:\.\d*)?|\.\d+)$/) { print "is a decimal number\n" }
if (/^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/)
{ print "a C float\n" }

The first question is "What string is the regular expression applied
to?"

I can recognize '\d+' as representing an arbitrary number of digits,
but what are '^' and '$' for ?

I don't care about distinctions between float decimal and real numbers.
However, I may have a need to distinguish between float and double
precision numbers. If that need materializes, how might I modify one
of the regular expressions above to allow me to determine if the value
in a given variable is necessarily a double (assuming that any single
precision number can be treated as if it is a double precision number:
for the purpose of converting strings from a text file into an
appropriate number).
From what I have read, I expect I can use '\w' to test whether or not a
variable contains a string consisting only of alpha numeric characters.
Is that right? What would I use to test, using a regular expression,
whether a given string contains only alphanumeric characters, and that
the total number of characters is less than or equal to 8? What about
testing for a string containing precisely 4 letters and 3 digits?

I will also need to be able to check to see whether or not a given
string represents a valid date or timestamp.

To put this back into my context, I'd be reading in the text file,
splitting each record into its fields. I'd also read in, from a
different file, information regarding the number of fields and the type
of each field. I'd then verify that there is the correct number of
fields and that each field has a valid string that contains the right
kind of data for that field. I still haven't decided how to handle the
fact that one of our suppliers sometimes switches between commas and
tabs, sometimes without warning. Suggestions are welcome, though.

Sorry if this seems basic, but it has been eons since I last looked at
regular expressions, and I have not found sufficient detail in the
documentation I have found.

Thanks,

Ted
 
J

Juha Laiho

Ted said:
In one of the FAQs, I found the following regular expressions, but I
have some questions.

if (/\D/) { print "has nondigits\n" }
if (/^\d+$/) { print "is a whole number\n" }
if (/^-?\d+$/) { print "is an integer\n" }
if (/^[+-]?\d+$/) { print "is a +/- integer\n" }
if (/^-?\d+\.?\d*$/) { print "is a real number\n" }
if (/^-?(?:\d+(?:\.\d*)?|\.\d+)$/) { print "is a decimal number\n" }
if (/^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/)
{ print "a C float\n" }

The first question is "What string is the regular expression applied
to?"

$_ - the "default" variable. This is set in various contexts; see
description in "perldoc perlvar".
I can recognize '\d+' as representing an arbitrary number of digits,
but what are '^' and '$' for ?

Start and end of variable. So, plain /\d+/ would match a23b as well as
23, whereas /^\d+$/ requires that the variable only contains digits.
Described in "perldoc perlre", by the way.
I don't care about distinctions between float decimal and real numbers.
However, I may have a need to distinguish between float and double
precision numbers.

For that I don't have an answer.
From what I have read, I expect I can use '\w' to test whether or not a
variable contains a string consisting only of alpha numeric characters.

No, with /\w/ would be true whenever the variable contains at least one
"word character". To ensure that you only have word characters, you
could use /^\w+$/ . If you also allow empty strings, then /^\w*$/ would
be the correct one.
Is that right? What would I use to test, using a regular expression,
whether a given string contains only alphanumeric characters, and that
the total number of characters is less than or equal to 8?

/^\w{0,8}$/ or {1,8}, if you don't want empty strings. \w includes
the underscore character, so you'll have to tune if you want to disallow
it.
What about testing for a string containing precisely 4 letters and 3 digits? /^[:alpha:]{4}\d{3}$/

I will also need to be able to check to see whether or not a given
string represents a valid date or timestamp.

Please start by defining all possible things you'd like to consider as
valid dates/timestamps. Then, if you also want to parse the actual
timestamps (i.e. know what the time/date is, in addition to just storing
the data), check that no two allowed formats can be confused with
each other.
Sorry if this seems basic, but it has been eons since I last looked at
regular expressions, and I have not found sufficient detail in the
documentation I have found.

"perldoc perlre", distributed with your perl interpreter, and online at
http://www.perl.com/doc/manual/html/pod/perlre.html . All my answers
above are from the data in that one document (except what was related to
$_).
 
A

Alan J. Flavell

No, with /\w/ would be true whenever the variable contains at least one
"word character". To ensure that you only have word characters, you
could use /^\w+$/ . If you also allow empty strings, then /^\w*$/ would
be the correct one.

Pretty much, but, as the documentation (perldoc perlre) says, \w
includes also the underscore (OK, you said that later on); also, if
"use locale" is in effect, it includes whatever characters the locale
defines to be alphabetic.

regards
 
T

Tad McClellan

Ted said:
The context here is I need to create a script that validates data in


The common idiom for validating data is:

anchor the start
anchor the end
write a pattern in between that accounts for everything that
you want to allow

Then if the pattern matches the string, valid data, else invalid data.

fields in plain text files where fields may be surrounded by double
quotes and may be separated by commas or tabs. In fact, one supplier
of a data feed we use has been known to switch between comma separated
values and tab delimited values, often without warning.


In that case, I would attempt to detect what separator is being
used, then normalize it before proceeding to splitting out the
fields for individual validation.

In one of the FAQs, I found the following regular expressions, but I
have some questions.

if (/\D/) { print "has nondigits\n" }
if (/^\d+$/) { print "is a whole number\n" }
if (/^-?\d+$/) { print "is an integer\n" }
if (/^[+-]?\d+$/) { print "is a +/- integer\n" }
if (/^-?\d+\.?\d*$/) { print "is a real number\n" }
if (/^-?(?:\d+(?:\.\d*)?|\.\d+)$/) { print "is a decimal number\n" }
if (/^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/)
{ print "a C float\n" }

The first question is "What string is the regular expression applied
to?"


You should check Perl's std docs *before* posting to the Perl newsgroup.

The description of the m// operator in perlop.pod says what string
will be searched by default, and how to make it look somewhere
besides that default place if you wish to.

If no string is specified via the =~ or !~ operator,
the $_ string is searched.

I can recognize '\d+' as representing an arbitrary number of digits,


It does not match zero digits, so not quite an "arbitrary number".

but what are '^' and '$' for ?


Once again, going to the docs is faster, more authoritative, and
helps you to avoid wearing out your welcome before you get to
questions that cannot be answered by a cursory search of the
documentation.

perldoc perlre

^ Match the beginning of the line
$ Match the end of the line (or before newline at the end)


(my code below ignores that parenthetical, \z might be better
than $ for your application...)

variable contains a string consisting only of alpha numeric characters.
^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^
^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^
Is that right?


No.

First it does not match only alphanumerics, just as perlre.pod says:

\w Match a "word" character (alphanumeric plus "_")

Secondly, \w can be used to test if the string (which may not be
in a "variable") _contains_ an alphanumeric or "_" character.

To get to "consisting only of", you need to apply the idiom:

/^\w+$/

or

/^[a-zA-Z0-9_]+$/

or, if you really want only alphanumerics

/^[a-zA-Z0-9]+$/

What would I use to test, using a regular expression,
whether a given string contains only alphanumeric characters, and that
the total number of characters is less than or equal to 8?


/^\w{0,8}$/

but your spec is probably incomplete, so I think you probably want:

/^\w{1,8}$/

instead.

What about
testing for a string containing precisely 4 letters and 3 digits?


One part regex, two parts NOT a regex:

/^[a-zA-Z0-9]{7}$/ and tr/a-zA-Z// == 4 and tr/0-9// == 3

I will also need to be able to check to see whether or not a given
string represents a valid date or timestamp.


You are going to need to give more precise criteria for "valid" here.

In most of _my_ applications I usually use:

/^\d\d\d\d-\d\d-\d\d$/

and call it good enough.

If you want 2006-02-30 or 2006-13-01 to be invalid, or if you want
\d\d\d\d-02-29 to be valid for some years and invalid for other
years, then I'd start looking for a module on CPAN...

I still haven't decided how to handle the
fact that one of our suppliers sometimes switches between commas and
tabs, sometimes without warning. Suggestions are welcome, though.


Insufficient information.

When commas are used, can you have commas in fields?

When tabs are used, can you have tabs in fields?

If the format allows seperators in quoted fields, then how are
quotes represented in quoted fields?

Is there a fixed and expected number of fields in a record?

If not, then can you at least expect the _same_ number of fields
in any particular file?



You can perhaps "guess".

Read the first 10 or 20 records and calculate the tabs/commas ratio
for each, then see if most of the ratios are are greater or less
than one.

Certainly not robust or fool-proof, but would probably work on most data...

Sorry if this seems basic,


"basic" is nothing to apologize for. There is no "minimum complexity"
expected for posting here.

Asking things that can be answered straightaway by a cursory search
of Perl's standard documentation however is another matter.

Have you seen the Posting Guidelines that are posted here frequently?

but it has been eons since I last looked at
regular expressions, and I have not found sufficient detail in the
documentation I have found.


If you tell us what documentation you have found, then we might be
able to tell you about some that you have not found...

Have you found "perlop.pod" and "perlre.pod" for instance?


See also:

perldoc perlrequick

perldoc perlretut
 
H

Henry Law

Ted said:
The context here is I need to create a script that validates data in
fields in plain text files where fields may be surrounded by double
quotes and may be separated by commas or tabs. <snip>

Lots of good advice in other posts, Ted. Here's what I'd add

1. Like the good programmer I hope you are, you need to be very
precise with the specifications of what is and is not "valid".
For example in your question about distinguishing between
single and double-precision numbers, can you state a rule
that would allow you to distinguish reliably between them?
If so then you, or a combination of you and the group, can
code it.

2. Don't forget CPAN. Go to http://search.cpan.org and look for
some modules that may help. I did so on your behalf and thought
Data::Validate and also maybe Data::FormValidator::Constraints::Dates
looked relevant. Have a look for yourself and see what I've missed.

3. In a complex situation like this I don't think you should expect
to do all your validation simply and elegantly. Particularly if
your data has some real funnies, like switching between delimiters
in mid-stream! Also your date stamps may be quite idiosyncratic,
such that the standard modules don't understand them. Be prepared
to do some parsing on the data fields, and perhaps some substitution,
before using some standard module or snippet of code.

4. If what you're doing is really Extract/Transform/Load then bear in
mind that the manufacturers of ETL tools make a good living out of
the fact that it's astoundingly difficult to code up rules that will
squash real-world data into the Procrustean bed of a fixed
format! Good luck.
 
E

Eric Bohlman

2. Don't forget CPAN. Go to http://search.cpan.org and look for
some modules that may help. I did so on your behalf and thought
Data::Validate and also maybe Data::FormValidator::Constraints::Dates
looked relevant. Have a look for yourself and see what I've missed.

He should also take a look at Regexp::Common which provides an impressive
collection of already-written and exhaustively-tested (no kidding!) regular
expressions for matching common data formats.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,705
Latest member
Stefkari24

Latest Threads

Top