J
John Nagle
I'm working on street address parsing again, and I'm trying to deal
with some of the harder cases.
Here's a subparser, intended to take in things like "N MAIN" and
"SOUTH", and break out the "directional" from street name.
Directionals = ['southeast', 'northeast', 'north', 'northwest',
'west', 'east', 'south', 'southwest', 'SE', 'NE', 'N', 'NW',
'W', 'E', 'S', 'SW']
direction = Combine(MatchFirst(map(CaselessKeyword, directionals)) +
Optional(".").suppress())
streetNameParser = Optional(direction.setResultsName("predirectional"))
+ Combine(OneOrMore(Word(alphanums)),
adjacent=False, joinString=" ").setResultsName("streetname")
This parses something like "N WEBB" fine; "N" is the "predirectional",
and "WEBB" is the street name.
"SOUTH" (which, when not followed by another word, is a streetname,
not a predirectional), raises a parsing exception:
Street address line parse failed for SOUTH : Expected Wabcd...)
(at char 5), (line:1, col:6)
The problem is that "direction" matched SOUTH, and even though
"direction" is within an "Optional" and followed by another word,
the parser didn't back up when it hit the end of the expression
without satisfying the OneOrMore clause.
Pyparsing does some backup, but I'm not clear on how much,
or how to force it to happen. There's some discussion at
"http://www.mail-archive.com/[email protected]/msg169559.html".
Apparently the "Or" operator will force some backup, but it's not
clear how much lookahead and backtracking is supported.
John Nagle
with some of the harder cases.
Here's a subparser, intended to take in things like "N MAIN" and
"SOUTH", and break out the "directional" from street name.
Directionals = ['southeast', 'northeast', 'north', 'northwest',
'west', 'east', 'south', 'southwest', 'SE', 'NE', 'N', 'NW',
'W', 'E', 'S', 'SW']
direction = Combine(MatchFirst(map(CaselessKeyword, directionals)) +
Optional(".").suppress())
streetNameParser = Optional(direction.setResultsName("predirectional"))
+ Combine(OneOrMore(Word(alphanums)),
adjacent=False, joinString=" ").setResultsName("streetname")
This parses something like "N WEBB" fine; "N" is the "predirectional",
and "WEBB" is the street name.
"SOUTH" (which, when not followed by another word, is a streetname,
not a predirectional), raises a parsing exception:
Street address line parse failed for SOUTH : Expected Wabcd...)
(at char 5), (line:1, col:6)
The problem is that "direction" matched SOUTH, and even though
"direction" is within an "Optional" and followed by another word,
the parser didn't back up when it hit the end of the expression
without satisfying the OneOrMore clause.
Pyparsing does some backup, but I'm not clear on how much,
or how to force it to happen. There's some discussion at
"http://www.mail-archive.com/[email protected]/msg169559.html".
Apparently the "Or" operator will force some backup, but it's not
clear how much lookahead and backtracking is supported.
John Nagle