Suggestion for a new regular expression extension

N

Nicolas LEHUEN

Hi,

I'm currently writing various regular expressions designed to help me parse
some real-world French postal addresses. The task is not easy due to the
vast amount of abbreviations, misspelling and variations in adresses. Just
to give you a taste of what the regular expression looks like (unoptimized
and perfectible, but for now it performs well enough) :

re_adresse = re.compile(r'''
(?P<street_number>\d+(?:[ /\-]\d+)?)?
\s*
(?:(?P<street_number_extension>
A
| B(?:IS)?
| C
| E
| F
| T(?:ER|RE)?
| Q(?:UATER)?
)\b)?
\s*
(?P<street_type>(?:
(?:G(?:DE?|RDE?|RANDE?)\s+)?R(?:UE)?
....... (snip) ....
| B(?:D|LD|VD|OUL(?:EVARD)?)
....... (snip) ....
)\b)?
(?:\s*(?P<street_name>.+))?
$
''',re.X)

Note for example the many abbreviations (correct or not) ouf "boulevard" :
BD, BLD, BVD, BOUL, BOULEVARD. For normalisation purposes, I need to
transform all those forms into the only correct abbreviation, BD.

What would be really, really neat, would be a regular expression extension
notation that would make the RE engine to return an arbitrary string when a
substring is matched. The standard parenthesis operator return the matched
text, whereas this extension would return any arbitrary text when getting a
match.

In my particular case, it would be very handy, allowing me to tell the RE
engine to return me "BD" when matching B(?:D|LD|VD|OUL(?:EVARD)?). For now,
without the extension, I need a two-pass process. First I try to "tokenize"
the adress using the big regular expression cited above, then for each token
I try to normalize it using a duplicate of the regular expression. This
forces me to have two separate regular expression sets and requires maybe
twice the processing power, whereas with an appropriate RE extension, all
this could be done in a single pass.

This extension would also be quite interesting to build transliterators,
especially if the returned value could include references to other captured
string.

Let's say the extension would be written (?PR<text to return when
parenthesis matches>regular expression), with P meaning <P>ython extension
('15','BD','HAUSSMANN')

Perhaps the rewriting expression could include reference to other matched
parentheses (but ) :
('14','4')

Maybe forward references would be too difficult to handle. The difficulty
with this would be how to handle an expression like (?R<\2>.+)(\1) (throw an
exception ?). The simplest thing to do would be to only allow back
references, or only references to the current match of the parenthesis, with
a notation like \m :
'$1540.00'

But anyway the reference to other groups in the rewriting expression would
be only a plus. The core suggestion is just the rewrite extension.

I also considered using sre.Scanner to do the stuff, but does anyone know
what is the status of this class ? I made a few test and it seems to work,
but it is still marked as 'experimental'. Why ? Last reference I saw to this
class is there :
http://aspn.activestate.com/ASPN/Mail/Message/python-dev/1614505... So, is
this class good enough for common usage ? Anyway, this wouldn't suffice here
because I would need a Scanner for the full adresse using different
sub-Scanners for each address part...

Best regards,
Nicolas
 
S

Skip Montanaro

Nicolas> re_adresse = re.compile(r'''
... [big, ugly re snipped] ...
Nicolas> ''',re.X)

Nicolas> Note for example the many abbreviations (correct or not) ouf
Nicolas> "boulevard" : BD, BLD, BVD, BOUL, BOULEVARD. For normalisation
Nicolas> purposes, I need to transform all those forms into the only
Nicolas> correct abbreviation, BD.

Nicolas> What would be really, really neat, would be a regular
Nicolas> expression extension notation that would make the RE engine to
Nicolas> return an arbitrary string when a substring is matched.

Why not just use named groups, then pass the match's groupdict() result
through a normalization function? Here's a trivial example which
"normalizes" some matches by replacing them with the matched strings'
lengths.
... d['a'] = len(d['a'])
... d['b'] = len(d['b'])
... {'a': 8, 'b': 3}

Skip
 
N

Nicolas LEHUEN

Hi Skip,

Well, that's what I am doing now, since I cannot hold my breath until my
suggestion gets implemented :). But in my case, it forces me to duplicate
each alternative in the big regexp in my normalisation function, which
causes quite tedious maintenance of the whole piece of code. It would feel
pretty much more natural just to say to the RE engine "if you match
B(?D|LD|VD|OUL(?:EVARD)) within this big ugly regexp, just return me BD,
please".

Anyway, I think I'm going to try using sre.Scanner, we'll see if it's stable
enough for that. I'll build 3 scanners that I'll call in sequence (each one
reusing the part of the string that was not scanned, handily returned in the
second part of the returned sequence of the 'scan' method) :

- one for the number (or numbers) within the street : "14", or numbers like
"14-16" or "14/16" or whatever separator the person entering the address
could imagine.

- one for the number extension : "B" or "BIS", "T" or "TER" or "TRE"
(misspelled, but that's the way some people write it...)

- one for the street/place type : most of the tricky regexp are there, most
of the rewriting will be performed by actions defined in the Scanner's
lexicon

- and the rest of the string is the street/place name.

This way the address will be processed in one pass without code duplication.

But still, this (?PR<...>...) notation would be handy. I had a look at the
sre source code, in hope that I would be able to implement it myself, but
it's a bit too much for me to handle right now ;).

Regards,

Nicolas

Skip Montanaro said:
Nicolas> re_adresse = re.compile(r'''
... [big, ugly re snipped] ...
Nicolas> ''',re.X)

Nicolas> Note for example the many abbreviations (correct or not) ouf
Nicolas> "boulevard" : BD, BLD, BVD, BOUL, BOULEVARD. For normalisation
Nicolas> purposes, I need to transform all those forms into the only
Nicolas> correct abbreviation, BD.

Nicolas> What would be really, really neat, would be a regular
Nicolas> expression extension notation that would make the RE engine to
Nicolas> return an arbitrary string when a substring is matched.

Why not just use named groups, then pass the match's groupdict() result
through a normalization function? Here's a trivial example which
"normalizes" some matches by replacing them with the matched strings'
lengths.
... d['a'] = len(d['a'])
... d['b'] = len(d['b'])
...{'a': 8, 'b': 3}

Skip
 
S

Skip Montanaro

Nicolas> Anyway, I think I'm going to try using sre.Scanner, ...

How about re.finditer() or re.findall()?

Skip
 
D

darrell

If you can make re.sub work for you...

import re
pat="B(?:D|LD|VD|OUL(?:EVARD)?)"
input="BD, BLD, BVD, BOUL, BOULEVARD.".split()

for i in input:
print i, "::", re.sub(pat, "BD", i)

BD, :: BD,
BLD, :: BD,
BVD, :: BD,
BOUL, :: BD,
BOULEVARD. :: BD.


For more complex cases you can pass a function instead of the
replacement string.


--Darrell
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,702
Latest member
LukasConde

Latest Threads

Top