regex matching question

  • Thread starter bullockbefriending bard
  • Start date
B

bullockbefriending bard

first, regex part:

I am new to regexes and have come up with the following expression:
((1[0-4]|[1-9]),(1[0-4]|[1-9])/){5}(1[0-4]|[1-9]),(1[0-4]|[1-9])

to exactly match strings which look like this:

1,2/3,4/5,6/7,8/9,10/11,12

i.e. 6 comma-delimited pairs of integer numbers separated by the
backslash character + constraint that numbers must be in range 1-14.

i should add that i am only interested in finding exact matches (doing
some kind of command line validation).

this seems to work fine, although i would welcome any advice about how
to shorten the above. it seems to me that there should exist some
shorthand for (1[0-4]|[1-9]) once i have defined it once?

also (and this is where my total beginner status brings me here
looking for help :)) i would like to add one more constraint to the
above regex. i want to match strings *iff* each pair of numbers are
different. e.g: 1,1/3,4/5,6/7,8/9,10/11,12 or
1,2/3,4/5,6/7,8/9,10/12,12 should fail to be matched by my final
regex whereas 1,2/3,4/5,6/7,8/9,10/11,12 should match OK.

any tips would be much appreciated - especially regarding preceding
paragraph!

and now for the python part:

results = "1,2/3,4/5,6/7,8/9,10/11,12"
match = re.match("((1[0-4]|[1-9]),(1[0-4]|[1-9])/){5}(1[0-4]|[1-9]),
(1[0-4]|[1-9])", results)
if match == None or match.group(0) != results:
raise FormatError("Error in format of input string: %s" %
(results))
results = [leg.split(',') for leg in results.split('/')]
# => [['1', '2'], ['3', '4'], ['5', '6'], ['7', '8'], ['9', '10'],
['11', '12']]
..
..
..
the idea in the above code being that i want to use the regex match as
a test of whether or not the input string (results) is correctly
formatted. if the string results is not exactly matched by the regex,
i want my program to barf an exception and bail out. apart from
whether or not the regex is good idiom, is my approach suitably
pythonic?

TIA for any help here.
 
M

Marc 'BlackJack' Rintsch

first, regex part:

I am new to regexes and have come up with the following expression:
((1[0-4]|[1-9]),(1[0-4]|[1-9])/){5}(1[0-4]|[1-9]),(1[0-4]|[1-9])

to exactly match strings which look like this:

1,2/3,4/5,6/7,8/9,10/11,12

i.e. 6 comma-delimited pairs of integer numbers separated by the
backslash character + constraint that numbers must be in range 1-14.

i should add that i am only interested in finding exact matches (doing
some kind of command line validation).

[…]

the idea in the above code being that i want to use the regex match as
a test of whether or not the input string (results) is correctly
formatted. if the string results is not exactly matched by the regex,
i want my program to barf an exception and bail out. apart from
whether or not the regex is good idiom, is my approach suitably
pythonic?

I would use a simple regular expression to extract "candidates" and a
Python function to split the candidate and check for the extra
constraints. Especially the "all pairs different" constraint is something
I would not even attempt to put in a regex. For searching candidates this
should be good enough::

r'(\d+,\d+/){5}\d+,\d+'

Ciao,
Marc 'BlackJack' Rintsch
 
B

bullockbefriending bard

thanks for your suggestion. i had already implemented the all pairs
different constraint in python code. even though i don't really need
to give very explicit error messages about what might be wrong with my
data (obviously easier to do if do all constraint validation in code
rather than one regex), there is something to be said for your
suggestion to simplify my regex further - it might be sensible from a
maintainability/readability perspective to use regex for *format*
validation and then validate all *values* in code.

from my cursory skimming of friedl, i get the feeling that the all
pairs different constraint would give rise to some kind of fairly
baroque expression, perhaps likely to bring to mind the following
quotation from samuel johnson:

"Sir, a woman's preaching is like a dog's walking on his hind legs.
It is not done well; but you are surprised to find it done at all."

however, being human, sometimes some things should be done, just
because they can :)... so if anyone knows hows to do it, i'm still
interested, even if just out of idle curiosity!

In <[email protected]>,



bullockbefriending said:
first, regex part:
I am new to regexes and have come up with the following expression:
((1[0-4]|[1-9]),(1[0-4]|[1-9])/){5}(1[0-4]|[1-9]),(1[0-4]|[1-9])
to exactly match strings which look like this:

i.e. 6 comma-delimited pairs of integer numbers separated by the
backslash character + constraint that numbers must be in range 1-14.
i should add that i am only interested in finding exact matches (doing
some kind of command line validation).

the idea in the above code being that i want to use the regex match as
a test of whether or not the input string (results) is correctly
formatted. if the string results is not exactly matched by the regex,
i want my program to barf an exception and bail out. apart from
whether or not the regex is good idiom, is my approach suitably
pythonic?

I would use a simple regular expression to extract "candidates" and a
Python function to split the candidate and check for the extra
constraints. Especially the "all pairs different" constraint is something
I would not even attempt to put in a regex. For searching candidates this
should be good enough::

r'(\d+,\d+/){5}\d+,\d+'

Ciao,
Marc 'BlackJack' Rintsch
 
J

John Machin

first, regex part:

I am new to regexes and have come up with the following expression:
((1[0-4]|[1-9]),(1[0-4]|[1-9])/){5}(1[0-4]|[1-9]),(1[0-4]|[1-9])

to exactly match strings which look like this:

1,2/3,4/5,6/7,8/9,10/11,12

i.e. 6 comma-delimited pairs of integer numbers separated by the
backslash character + constraint that numbers must be in range 1-14.

Backslash? Your example uses a [forward] slash.

Are you sure you don't want to allow for some spaces in the data, for
the benefit of the humans, e.g.
1,2 / 3,4 / 5,6 / 7,8 / 9,10 / 11,12
?
i should add that i am only interested in finding exact matches (doing
some kind of command line validation).

this seems to work fine, although i would welcome any advice about how
to shorten the above. it seems to me that there should exist some
shorthand for (1[0-4]|[1-9]) once i have defined it once?

also (and this is where my total beginner status brings me here
looking for help :)) i would like to add one more constraint to the
above regex. i want to match strings *iff* each pair of numbers are
different. e.g: 1,1/3,4/5,6/7,8/9,10/11,12 or
1,2/3,4/5,6/7,8/9,10/12,12 should fail to be matched by my final
regex whereas 1,2/3,4/5,6/7,8/9,10/11,12 should match OK.

any tips would be much appreciated - especially regarding preceding
paragraph!

and now for the python part:

results = "1,2/3,4/5,6/7,8/9,10/11,12"
match = re.match("((1[0-4]|[1-9]),(1[0-4]|[1-9])/){5}(1[0-4]|[1-9]),
(1[0-4]|[1-9])", results)

Always use "raw" strings for patterns, even if you don't have
backslashes in them -- and this one needs a backslash; see below.

For clarity, consider using "mobj" or even "m" instead of "match" to
name the result of re.match.

if match == None or match.group(0) != results:

Instead of
if mobj == None ....
use
if mobj is None ...
or
if not mobj ...

Instead of the "or match.group(0) != results" caper, put \Z (*not* $) at
the end of your pattern:
mobj = re.match(r"pattern\Z", results)
if not mobj:


HTH,
John
 
G

Gabriel Genellina

En Sat, 19 May 2007 19:40:39 -0300, bullockbefriending bard
from my cursory skimming of friedl, i get the feeling that the all
pairs different constraint would give rise to some kind of fairly
baroque expression, perhaps likely to bring to mind the following
quotation from samuel johnson:

"Sir, a woman's preaching is like a dog's walking on his hind legs.
It is not done well; but you are surprised to find it done at all."

Try this, it's not as hard, just using match and split (with the regular
expression propossed by MR):

import re
regex = re.compile(r'(\d+,\d+/){5}\d+,\d+')

def checkline(line):
if not regex.match(line):
raise ValueError("Invalid format: "+line)
for pair in line.split("/"):
a, b = pair.split(",")
if a==b:
raise ValueError("Duplicate number: "+line)

Here "all pairs different" means "for each pair, both numbers must be
different", but they may appear in another pair. That is, won't flag
"1,2/3,4/3,5/2,6/8,3/1,2" as invalid, but this wasn't clear from your
original post.
 
B

bullockbefriending bard

Backslash? Your example uses a [forward] slash.

correct.. my mistake. i use forward slashes.
Are you sure you don't want to allow for some spaces in the data, for
the benefit of the humans, e.g.
1,2 / 3,4 / 5,6 / 7,8 / 9,10 / 11,12

you are correct. however, i am using string as a command line option
and can get away without quoting it if there are no optional spaces.
Always use "raw" strings for patterns, even if you don't have
backslashes in them -- and this one needs a backslash; see below.

knew this, but had not done so in my code because wanted to use '\' as
a line continuation character to keep everything within 80 columns.
have adopted your advice regarding \Z below and now am using raw
string.
For clarity, consider using "mobj" or even "m" instead of "match" to
name the result of re.match.

good point.
Instead of
if mobj == None ....
use
if mobj is None ...
or
if not mobj ...

Instead of the "or match.group(0) != results" caper, put \Z (*not* $) at
the end of your pattern:
mobj = re.match(r"pattern\Z", results)
if not mobj:

HTH,
John

very helpful advice. thanks!
 
B

bullockbefriending bard

Instead of the "or match.group(0) != results" caper, put \Z (*not* $) at
the end of your pattern:
mobj = re.match(r"pattern\Z", results)
if not mobj:

as the string i am matching against is coming from a command line
argument to a script, is there any reason why i cannot get away with
just $ given that this means that there is no way a newline could find
its way into my string? certainly passes all my unit tests as well as
\Z. or am i missing the point of \Z ?
 
B

bullockbefriending bard

Here "all pairs different" means "for each pair, both numbers must be
different", but they may appear in another pair. That is, won't flag
"1,2/3,4/3,5/2,6/8,3/1,2" as invalid, but this wasn't clear from your
original post.

thanks! you are correct that the 'all pairs different' nomenclature is
ambiguous. i require that each pair have different values, but is OK
for different pairs to be identical... so exactly as per your code
snippet.
 
J

John Machin

as the string i am matching against is coming from a command line
argument to a script, is there any reason why i cannot get away with
just $ given that this means that there is no way a newline could find
its way into my string?

No way? Famous last words :)

C:\junk>type showargs.py
import sys; print sys.argv

C:\junk>\python25\python
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.['showargs.py', 'teehee\n']
0

certainly passes all my unit tests as well as
 
S

Steve Holden

John said:
as the string i am matching against is coming from a command line
argument to a script, is there any reason why i cannot get away with
just $ given that this means that there is no way a newline could find
its way into my string?

No way? Famous last words :)

C:\junk>type showargs.py
import sys; print sys.argv

C:\junk>\python25\python
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.['showargs.py', 'teehee\n']
0

certainly passes all my unit tests as well as
\Z. or am i missing the point of \Z ?
The simple shell command

python prog.py "argument containing
a newline"

would suffice to reject the "no newlines" hypothesis in Unix-like systems.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------
 
B

bullockbefriending bard

No way? Famous last words :)

C:\junk>type showargs.py
import sys; print sys.argv

C:\junk>\python25\python
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.['showargs.py', 'teehee\n']

can't argue with that :) back to \Z
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top