How to write simple code to match strings?

B

beginner

Hi All,

I run into a problem. I have a string s that can be a number of
possible things. I use a regular expression code like below to match
and parse it. But it looks very ugly. Also, the strings are literally
matched twice -- once for matching and once for extraction -- which
seems to be very slow. Is there any better way to handle this?


def convert_data_item(s):
if re.match('^\$?([-+]?[0-9,]*\.?[0-9,]+)$',s):
g=re.match('^\$?([-+]?[0-9,]*\.?[0-9,]+)$',s)
v=float(g.group(1).replace(',',''))
elif re.match('^\(\$?([-+]?[0-9,]*\.?[0-9,]+)\)$',s):
g=re.match('^\(\$?([-+]?[0-9,]*\.?[0-9,]+)\)$',s)
v=-float(g.group(1).replace(',',''))
elif re.match('^\d{1,2}-\w+-\d{1,2}$',s):
v=dateutil.parser.parse(s, dayfirst=True)
elif s=='-':
v=None
else:
print "Unrecognized format %s" % s
v=s
return v

Thanks,
Geoffrey
 
S

Steven D'Aprano

Hi All,

I run into a problem. I have a string s that can be a number of
possible things. I use a regular expression code like below to match and
parse it. But it looks very ugly. Also, the strings are literally
matched twice -- once for matching and once for extraction -- which
seems to be very slow. Is there any better way to handle this?

The most important thing you should do is to put the regular expressions
into named variables, rather than typing them out twice. The names
should, preferably, describe what they represent.

Oh, and you should use raw strings for regexes. In this particular
example, I don't think it makes a difference, but if you ever modify the
strings, it will!

You should get rid of the unnecessary double calls to match. That's just
wasteful. Also, since re.match tests the start of the string, you don't
need the leading ^ regex (but you do need the $ to match the end of the
string).

You should also fix the syntax error, where you have "elif s=='-'"
instead of "elif s='-'".

You should consider putting the cheapest test(s) first, or even moving
the expensive tests into a separate function.

And don't be so stingy with spaces in your source code, it helps
readability by reducing the density of characters.

So, here's my version:

def _re_match_items(s):
# Setup some regular expressions.
COMMON_RE = r'\$?([-+]?[0-9,]*\.?[0-9,]+)'
FLOAT_RE = COMMON_RE + '$'
BRACKETED_FLOAT_RE = r'\(' + COMMON_RE + r'\)$'
DATE_RE = r'\d{1,2}-\w+-\d{1,2}$'
mo = re.match(FLOAT_RE, s) # "mo" short for "match object"
if mo:
return float(mo.group(1).replace(',', ''))
# Otherwise mo will be None and we go on to the next test.
mo = re.match(BRACKETED_FLOAT_RE, s)
if mo:
return -float(mo.group(1).replace(',', ''))
if re.match(DATE_RE, s):
return dateutil.parser.parse(s, dayfirst=True)
raise ValueError("bad string can't be matched")


def convert_data_item(s):
if s = '-':
return None
else:
try:
return _re_match_items(s)
except ValueError:
print "Unrecognized format %s" % s
return s



Hope this helps.
 
B

beginner

Hi Steve,

I run into a problem.  I have a string s that can be a number of
possible things. I use a regular expression code like below to match and
parse it. But it looks very ugly. Also, the strings are literally
matched twice -- once for matching and once for extraction -- which
seems to be very slow. Is there any better way to handle this?

The most important thing you should do is to put the regular expressions
into named variables, rather than typing them out twice. The names
should, preferably, describe what they represent.

Oh, and you should use raw strings for regexes. In this particular
example, I don't think it makes a difference, but if you ever modify the
strings, it will!

You should get rid of the unnecessary double calls to match. That's just
wasteful. Also, since re.match tests the start of the string, you don't
need the leading ^ regex (but you do need the $ to match the end of the
string).

You should also fix the syntax error, where you have "elif s=='-'"
instead of "elif s='-'".

You should consider putting the cheapest test(s) first, or even moving
the expensive tests into a separate function.

And don't be so stingy with spaces in your source code, it helps
readability by reducing the density of characters.

So, here's my version:

def _re_match_items(s):
    # Setup some regular expressions.
    COMMON_RE = r'\$?([-+]?[0-9,]*\.?[0-9,]+)'
    FLOAT_RE = COMMON_RE + '$'
    BRACKETED_FLOAT_RE = r'\(' + COMMON_RE + r'\)$'
    DATE_RE = r'\d{1,2}-\w+-\d{1,2}$'
    mo = re.match(FLOAT_RE, s)  # "mo" short for "match object"
    if mo:
        return float(mo.group(1).replace(',', ''))
    # Otherwise mo will be None and we go on to the next test.
    mo = re.match(BRACKETED_FLOAT_RE, s)
    if mo:
        return -float(mo.group(1).replace(',', ''))
    if re.match(DATE_RE, s):
        return dateutil.parser.parse(s, dayfirst=True)
    raise ValueError("bad string can't be matched")

def convert_data_item(s):
    if s = '-':
        return None
    else:
        try:
            return _re_match_items(s)
        except ValueError:
            print "Unrecognized format %s" % s
            return s

Hope this helps.

This definitely helps.

I don't know if it should be s=='-' or s='-'. I thought == means equal
and = means assignment?

Thanks again,
G
 
S

Steven D'Aprano

def convert_data_item(s):
    if s = '-':
[...]
I don't know if it should be s=='-' or s='-'. I thought == means equal
and = means assignment?

Er, you're absolutely right.

Sorry for that, that's an embarrassing brain-fart. I don't know what I
was thinking.
 
S

Stefan Behnel

Steven D'Aprano, 30.12.2009 07:01:
def _re_match_items(s):
# Setup some regular expressions.
COMMON_RE = r'\$?([-+]?[0-9,]*\.?[0-9,]+)'
FLOAT_RE = COMMON_RE + '$'
BRACKETED_FLOAT_RE = r'\(' + COMMON_RE + r'\)$'
DATE_RE = r'\d{1,2}-\w+-\d{1,2}$'
mo = re.match(FLOAT_RE, s) # "mo" short for "match object"
if mo:
return float(mo.group(1).replace(',', ''))
# Otherwise mo will be None and we go on to the next test.
mo = re.match(BRACKETED_FLOAT_RE, s)
if mo:
return -float(mo.group(1).replace(',', ''))
if re.match(DATE_RE, s):
return dateutil.parser.parse(s, dayfirst=True)
raise ValueError("bad string can't be matched")

Given that this is meant for converting single data items, which may happen
quite frequently in a program (depending on the size of the input), you
might want to use pre-compiled regexps here.

Also, you can convert the above into a single regexp with multiple
alternative groups and then just run the matcher once, e.g. (untested):

COMMON_RE = r'\$?([-+]?[0-9,]*\.?[0-9,]+)'
FLOAT_RE = COMMON_RE + '$'
BRACKETED_FLOAT_RE = r'\(' + COMMON_RE + r'\)$'
DATE_RE = r'(\d{1,2}-\w+-\d{1,2})$' # note the surrounding () I added

match_data_items = re.compile('|'.join(
[BRACKETED_FLOAT_RE, FLOAT_RE, DATE_RE])).match

def convert_data_item(s):
# ...
match = match_data_items(s)
if match:
bfloat_value, float_value, date_value = match.groups()
if bfloat_value:
return -float(bfloat_value.replace(',', ''))
if float_value:
return float(bfloat_value.replace(',', ''))
if date_value:
return dateutil.parser.parse(date_value, dayfirst=True)
raise ...

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,225
Members
46,815
Latest member
treekmostly22

Latest Threads

Top