Phone number regular expression...

J

joemono

Hello everyone!

First, I appologize if this posting isn't proper "netiquette" for this
group.

I've been working with perl for almost 2 years now. However, my regular
expression knowledge is pretty limited. I wrote the following expression to
take (hopefully) any _reasonable_ phone number input, and format it as
(999) 999-9999 x 9999.

Here's what I've come up with. I would like your comments, if you've got the
time. I'm really interested in regular expressions, and I want to know if
what I'm doing is inefficient, slow, etc...

# area code
\({0,1}\s*(\d{3}){0,1}\s*\){0,1}
# optional parentheses, 3 digits, optional parentheses
(?=[-| ]*(\d{3}){1}[-| ]*(\d{4}){1}) #
match only if the first match is followed by

# what looks like a phone number

# this is the same match as the standard 7 digit phone number below
# main phone number
[-| ]*
(\d{3}){1} # first 3 digits
[-| ]*
(\d{4}){0,1} # second 4 digits

# extension
[-| |x|X]*
(\d{3,4}){0,1} # extension

For example, here's a question I have. Is there a way to use the look-ahead
match in the area code section _again_ for matching the main number, since
they are the same? I also know that I could use ? instead of {0,1}
(correct?), but I always get confused between that and non-greedy
quantifier. Does that make sense?

I wrote a script to test it (it generates many different possible phone
number inputs, and then applies the regular expression), and it _seems_ to
work. But like I said, I kinda don't know what I'm doing. I've been using
http://www.perldoc.com/perl5.6/pod/perlre.html heavily. It's pretty useful.

Here's another question, do people ever have extensions less than 3, or
greater than 4 numbers?

Thanks for your help!

Joe
 
P

Purl Gurl

joemono wrote:

(snipped)
I wrote the following expression to take (hopefully) any _reasonable_
phone number input, and format it as
(999) 999-9999 x 9999.

Parameter is "reasonable" American style phone numbers.

what I'm doing is inefficient, slow, etc...

(snipped a lot of regex matching)

Yes, very slow, very inefficient. Do not invoke a
regex engine unless you have no choice, or a regex
actually "proves" to be the most efficient method
found within a collection of tested methods.

Is there a way to use the look-ahead match

Never use look-ahead unless you have no choice.
Using any style of look-ahead will almost always
be slow and inefficient compared to other methods.

Note my "almost always" does not mean "always" as some
might ignorantly claim. In some cases, a look-ahead
could be your only choice, or most efficient choice.

do people ever have extensions less than 3, or greater than 4 numbers?

Extensions cannot be predicted. Length of an extension is
directly controlled by an internal PBX system. An extension
length can literally be any length.

What is the length of those extensions you hear during a
recorded menu selection? Is there more than one extension?
These type of numbers, could be a problem.

1-800-tru-idiots
if you are stupid, press 1 now
*next menu*
if you are stupid and gullible, press 2 now
*next menu*
if you are stupid, gullible and tired of this, press 3 now
*next menu*
Thank you for calling America Onlame! You are an idiot! Goodbye!
*dial tone*

I count three extensions each with a length of one.

Your methodology allows parentheses, hyphens and such, then
tries to match for all possible combinations. This is quite
inefficient and prone to error.

Remove all characters except numbers, then work with your data.
You are interested in phone numbers, are you not? So work with
numbers, nothing else.

Keep in mind, regardless of what methodology you employ, there
is a good chance there will be false positives and false negatives.
Parsing phone numbers is similar to parsing email addresses; it
is difficult and unpredictable.

Look over my method below. This method eliminates all characters
except numbers, then generates a very uniform output appropriate
for a data file. Output is also easy on the human eye.


Ever wonder why people use "spelled" phone numbers, like

1-800-bite-me

When someone tries to give me a spelled number, I say,

"Don't bother. I will not call you."


Purl Gurl
--
Rock Midis! Science Fiction! Amazing Androids!
http://www.purlgurl.net/~callgirl

My $test_it is used to exemplify a non-destructive
method, needed for a print of invalid numbers. You
could easily use $_ throughout as well, but this
defeats "full" printing of an invalid phone number.

#!perl

while (<DATA>)
{
my $test_it = $_;
$test_it =~ s/[^\d+]//g;

if ($test_it =~ tr/0-9// == 7)
{
substr ($test_it, 3, 0, " ");
print "$test_it\n";
}
elsif ($test_it =~ tr/0-9// == 10)
{
substr ($test_it, 3, 0, " ");
substr ($test_it, 7, 0, " ");
print "$test_it\n";
}
elsif ($test_it =~ tr/0-9// > 10)
{
substr ($test_it, 3, 0, " ");
substr ($test_it, 7, 0, " ");
substr ($test_it, 12, 0, " ");
print "$test_it\n";
}
else
{ print "Phone Number Appears Invalid: $_\n"; }
}


__DATA__
123-4567
123 4567
(310) 123 4567
310-123-4567
310-123-4567 ext 890
310 123 4567 890
123-4567FUBAR
310 123 FUBAR



PRINTED RESULTS:
________________

123 4567
123 4567
310 123 4567
310 123 4567
310 123 4567 890
310 123 4567 890
123 4567
Phone Number Appears Invalid: 310 123 FUBAR
 
R

Roy Johnson

I thought that you made a few odd (either esoteric or not Lazy enough)
implementation decisions.

Purl Gurl said:
[...]You could easily use $_ throughout as well, but this
defeats "full" printing of an invalid phone number.

Instead of preserving $_ and working on $test_it, you could have saved
a copy and then worked on $_ itself.

You used s/[^\d+]//g instead of tr/0-9//dc to remove all non-digits.

You used tr/0-9// instead of length.

The use of the 4-argument version of substr() was neat, but a
judicious pattern match instead of length-checking makes for tighter
code:

while (<DATA>) {
my $save = $_;
tr/0-9//dc;
if (/(...)?(...)(....)/) {
printf "%3s %s %s %s\n", $1, $2, $3, $';
}
else {
print "Invalid phone number: $save\n";
}
}

Now let's go back to the issue of stripping all non-numerics. If you
do that, you can't distinguish 123-4567 x890 from (123) 456 7890.
Granted, when you dial, the phone doesn't know the difference, but
there may be some difference in how the person doing the dialing has
to behave.

If, instead of stripping the non-digits, you just look for groups of
digits (optional 3, then mandatory 3 and 4, then optional however
many) amongst the non-digits, you can address that:

#!perl
while (<DATA>) {
my $save = $_;
if (/^\D*(?:(\d{3})\D+)?(\d{3})\D+(\d{4})(?:\D+(\d+))?/) {
printf "%3s %s %s %s\n", $1, $2, $3, $4;
}
else {
print "Invalid phone number: $save\n";
}
}

__DATA__
123-4567
123 4567
123 4567 x890 <-- note
(310) 123 4567
310-123-4567
310-123-4567 ext 890
310 123 4567 890
123-4567FUBAR
310 123 FUBAR


Output is:
123 4567
123 4567
123 4567 890
310 123 4567
310 123 4567
310 123 4567 890
310 123 4567 890
123 4567
Invalid phone number: 310 123 FUBAR
 
G

Gunnar Hjalmarsson

joemono said:
I wrote the following expression to take (hopefully) any
_reasonable_ phone number input, and format it as (999) 999-9999 x
9999.

Hi Joe,

I don't know the likelihood in your case that people outside the US
are asked to enter their phone numbers. The reason why I mention it is
that I have tried to enter my non-US number at quite a few US based
web sites, resulting in error messages...

So, out from that experience, I'd say that a strict phone number
checking is sometimes a really bad idea. ;-)

Gunnar
(Sweden)
 
P

Purl Gurl

Roy said:
Purl Gurl wrote in message
I thought that you made a few odd (either esoteric or not Lazy enough)
implementation decisions.

I have no interest in reading Code Cop Crap.

It is annoying to open an article only to discover
this type of troll mule manure you write.

Respond to the originating author as you should.

You are wasting your time and the time of readers.


Purl Gurl
 
R

Roy Johnson

Purl Gurl said:
I have no interest in reading Code Cop Crap.

Interesting. I have no interest in your critiques of my posts that
have nothing to do with Perl.

It's not "trolling" to point out that you're doing bizarre things when
straightforward methods are available. My code was much more clear
than yours, as well as being shorter.

delete $shoulder->{'chip'}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,231
Members
46,820
Latest member
GilbertoA5

Latest Threads

Top