How to write simple code to match strings?

beginner · Dec 30, 2009

Hi All,

I run into a problem. I have a string s that can be a number of
possible things. I use a regular expression code like below to match
and parse it. But it looks very ugly. Also, the strings are literally
matched twice -- once for matching and once for extraction -- which
seems to be very slow. Is there any better way to handle this?

def convert_data_item(s):
if re.match('^\$?([-+]?[0-9,]*\.?[0-9,]+)$',s):
g=re.match('^\$?([-+]?[0-9,]*\.?[0-9,]+)$',s)
v=float(g.group(1).replace(',',''))
elif re.match('^$\$?([-+]?[0-9,]*\.?[0-9,]+)$$',s):
g=re.match('^$\$?([-+]?[0-9,]*\.?[0-9,]+)$$',s)
v=-float(g.group(1).replace(',',''))
elif re.match('^\d{1,2}-\w+-\d{1,2}$',s):
v=dateutil.parser.parse(s, dayfirst=True)
elif s=='-':
v=None
else:
print "Unrecognized format %s" % s
v=s
return v

Thanks,
Geoffrey

Steven D'Aprano · Dec 30, 2009

Hi All,

I run into a problem. I have a string s that can be a number of
possible things. I use a regular expression code like below to match and
parse it. But it looks very ugly. Also, the strings are literally
matched twice -- once for matching and once for extraction -- which
seems to be very slow. Is there any better way to handle this?

The most important thing you should do is to put the regular expressions
into named variables, rather than typing them out twice. The names
should, preferably, describe what they represent.

Oh, and you should use raw strings for regexes. In this particular
example, I don't think it makes a difference, but if you ever modify the
strings, it will!

You should get rid of the unnecessary double calls to match. That's just
wasteful. Also, since re.match tests the start of the string, you don't
need the leading ^ regex (but you do need the $ to match the end of the
string).

You should also fix the syntax error, where you have "elif s=='-'"
instead of "elif s='-'".

You should consider putting the cheapest test(s) first, or even moving
the expensive tests into a separate function.

And don't be so stingy with spaces in your source code, it helps
readability by reducing the density of characters.

So, here's my version:

def _re_match_items(s):
# Setup some regular expressions.
COMMON_RE = r'\$?([-+]?[0-9,]*\.?[0-9,]+)'
FLOAT_RE = COMMON_RE + '$'
BRACKETED_FLOAT_RE = r'$' + COMMON_RE + r'$$'
DATE_RE = r'\d{1,2}-\w+-\d{1,2}$'
mo = re.match(FLOAT_RE, s) # "mo" short for "match object"
if mo:
return float(mo.group(1).replace(',', ''))
# Otherwise mo will be None and we go on to the next test.
mo = re.match(BRACKETED_FLOAT_RE, s)
if mo:
return -float(mo.group(1).replace(',', ''))
if re.match(DATE_RE, s):
return dateutil.parser.parse(s, dayfirst=True)
raise ValueError("bad string can't be matched")

def convert_data_item(s):
if s = '-':
return None
else:
try:
return _re_match_items(s)
except ValueError:
print "Unrecognized format %s" % s
return s

Hope this helps.

beginner · Dec 30, 2009

Hi Steve,

Hi All,

Click to expand...

I run into a problem. I have a string s that can be a number of
possible things. I use a regular expression code like below to match and
parse it. But it looks very ugly. Also, the strings are literally
matched twice -- once for matching and once for extraction -- which
seems to be very slow. Is there any better way to handle this?

Click to expand...

The most important thing you should do is to put the regular expressions
into named variables, rather than typing them out twice. The names
should, preferably, describe what they represent.

Oh, and you should use raw strings for regexes. In this particular
example, I don't think it makes a difference, but if you ever modify the
strings, it will!

You should get rid of the unnecessary double calls to match. That's just
wasteful. Also, since re.match tests the start of the string, you don't
need the leading ^ regex (but you do need the $ to match the end of the
string).

You should also fix the syntax error, where you have "elif s=='-'"
instead of "elif s='-'".

You should consider putting the cheapest test(s) first, or even moving
the expensive tests into a separate function.

And don't be so stingy with spaces in your source code, it helps
readability by reducing the density of characters.

So, here's my version:

def _re_match_items(s):
# Setup some regular expressions.
COMMON_RE = r'\$?([-+]?[0-9,]*\.?[0-9,]+)'
FLOAT_RE = COMMON_RE + '$'
BRACKETED_FLOAT_RE = r'$' + COMMON_RE + r'$$'
DATE_RE = r'\d{1,2}-\w+-\d{1,2}$'
mo = re.match(FLOAT_RE, s) # "mo" short for "match object"
if mo:
return float(mo.group(1).replace(',', ''))
# Otherwise mo will be None and we go on to the next test.
mo = re.match(BRACKETED_FLOAT_RE, s)
if mo:
return -float(mo.group(1).replace(',', ''))
if re.match(DATE_RE, s):
return dateutil.parser.parse(s, dayfirst=True)
raise ValueError("bad string can't be matched")

def convert_data_item(s):
if s = '-':
return None
else:
try:
return _re_match_items(s)
except ValueError:
print "Unrecognized format %s" % s
return s

Hope this helps.

This definitely helps.

I don't know if it should be s=='-' or s='-'. I thought == means equal
and = means assignment?

Thanks again,
G

Steven D'Aprano · Dec 30, 2009

def convert_data_item(s):
Â Â if s = '-':

Click to expand...

[...]
I don't know if it should be s=='-' or s='-'. I thought == means equal
and = means assignment?

Er, you're absolutely right.

Sorry for that, that's an embarrassing brain-fart. I don't know what I
was thinking.

Stefan Behnel · Dec 30, 2009

Steven D'Aprano, 30.12.2009 07:01:

def _re_match_items(s):
# Setup some regular expressions.
COMMON_RE = r'\$?([-+]?[0-9,]*\.?[0-9,]+)'
FLOAT_RE = COMMON_RE + '$'
BRACKETED_FLOAT_RE = r'$' + COMMON_RE + r'$$'
DATE_RE = r'\d{1,2}-\w+-\d{1,2}$'
mo = re.match(FLOAT_RE, s) # "mo" short for "match object"
if mo:
return float(mo.group(1).replace(',', ''))
# Otherwise mo will be None and we go on to the next test.
mo = re.match(BRACKETED_FLOAT_RE, s)
if mo:
return -float(mo.group(1).replace(',', ''))
if re.match(DATE_RE, s):
return dateutil.parser.parse(s, dayfirst=True)
raise ValueError("bad string can't be matched")

Given that this is meant for converting single data items, which may happen
quite frequently in a program (depending on the size of the input), you
might want to use pre-compiled regexps here.

Also, you can convert the above into a single regexp with multiple
alternative groups and then just run the matcher once, e.g. (untested):

COMMON_RE = r'\$?([-+]?[0-9,]*\.?[0-9,]+)'
FLOAT_RE = COMMON_RE + '$'
BRACKETED_FLOAT_RE = r'$' + COMMON_RE + r'$$'
DATE_RE = r'(\d{1,2}-\w+-\d{1,2})$' # note the surrounding () I added

match_data_items = re.compile('|'.join(
[BRACKETED_FLOAT_RE, FLOAT_RE, DATE_RE])).match

def convert_data_item(s):
# ...
match = match_data_items(s)
if match:
bfloat_value, float_value, date_value = match.groups()
if bfloat_value:
return -float(bfloat_value.replace(',', ''))
if float_value:
return float(bfloat_value.replace(',', ''))
if date_value:
return dateutil.parser.parse(date_value, dayfirst=True)
raise ...

Stefan

assignment expression peeve	47	Oct 15, 2003
FAQ 6.23 How can I match strings with multibyte characters?	0	Jan 11, 2011
Regex to match a numerical IP range	7	Dec 11, 2010
Replace an occurrence of a regexp with a function call on a substringof the match, multiple times on	4	Sep 16, 2013
How to generate execute file that include enthought.traits.api ,enthought.traits.ui.api ?	1	Jun 4, 2010
regex matching question	10	May 19, 2007
KirbyBase : replacing string exceptions	2	Nov 23, 2009
Must be a bug in the re module [was: Why this result with the remodule]	0	Nov 3, 2010

How to write simple code to match strings?

beginner

Steven D'Aprano

beginner

Steven D'Aprano

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads