Street address parsing in Python, again.

John Nagle · Jun 4, 2010

I'm still struggling with street address parsing in Python.
(Previous discussion:
http://www.velocityreviews.com/forums/t720759-usable-street-address-parser-in-python.html)

I need something good enough to reliably extract street name and number.
That gives me something I can match against databases.

There are several parsers available in Perl, and various online services that
have a street name database. The online parsers are good, but I need to encode
some big databases, and the online ones are either rate-limited or expensive.

The parser at PyParsing:

http://pyparsing.wikispaces.com/file/view/streetAddressParser.py

seems to work on about 80% of addresses. Addresses with "pre-directionals"
and street types before the name seem to give the most trouble:

487 E. Middlefield Rd. -> streetnumber = 487, streetname = E. MIDDLEFIELD
487 East Middlefield Road -> streetnumber = 487, streetname = EAST MIDDLEFIELD
226 West Wayne Street -> streetnumber = 226, streetname = WEST WAYNE
(Those are all Verisign offices)

New Orchard Road -> streetnumber = , streetname = NEW
1 New Orchard Road -> streetnumber = 1 , streetname = NEW
(IBM corporate HQ)

390 Park Avenue -> streetnumber =, streetname = 390
(Alcoa corporate HQ)

None of those addresses are exotic or corner cases, but they're
all mis-parsed.

There's a USPS standard on this which might be helpful.

http://pe.usps.com/text/pub28/28c2_003.html

That says "When parsing the Delivery Address Line into the individual
components, start from the right-most element of the address and work toward the
left. Place each element in the appropriate field until all address components
are isolated." PyParsing works left to right, and trying to do look-ahead to
achieve the effect of right-to-left isn't working. It may be necessary to split
the input, reverse the tokens, and write a parser that works in reverse.

John Nagle

John Nagle · Jun 4, 2010

John said:
The parser at PyParsing:

http://pyparsing.wikispaces.com/file/view/streetAddressParser.py

..Bad cases...
487 E. Middlefield Rd. -> streetnumber = 487, streetname = E. MIDDLEFIELD
487 East Middlefield Road -> streetnumber = 487, streetname = EAST MIDDLEFIELD
226 West Wayne Street -> streetnumber = 226, streetname = WEST WAYNE
New Orchard Road -> streetnumber = , streetname = NEW
1 New Orchard Road -> streetnumber = 1 , streetname = NEW
390 Park Avenue -> streetnumber =, streetname = 390

Here's a system that gets all the above cases right: the USC Deterministic
Address Parser.

https://webgis.usc.edu/Services/AddressNormalization/Interactive/DeterministicNormalization.aspx

This will parse a street address line alone, without a city, state, or ZIP code,
so it's not using a big database. There's a technical paper

http://gislab.usc.edu/i/publications/gislabtr11.pdf

but it doesn't have that much detail. However, now we know a solution
exists. I've asked USC if they'll make the code available.

John Nagle

Usable street address parser in Python?	9	Apr 17, 2010
Getting pyparsing to backtrack	4	Jul 5, 2010
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
SINGAPORE PRIVATE CONDO / APT FOR SALE / Singapore New Upcoming Residential Projects	5	Dec 16, 2006
SINGAPORE PRIVATE CONDO / APT FOR SALE / Singapore New Upcoming Residential Projects	1	Dec 16, 2006
comp.lang.vhdl FAQ part 2 of 4: books	0	Jul 8, 2003
comp.lang.vhdl FAQ part 3 of 4: products & services	0	Jul 8, 2003
comp.lang.c Changes to Answers to Frequently Asked Questions (FAQ)	1	Jul 4, 2004

Street address parsing in Python, again.

John Nagle

John Nagle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads