Gilles Ganault said:
Hello
Some of the adresses are missing a space between the streetname and
the ZIP code, eg. "123 Main Street01159 Someville"
This problem appears very similar to the one you had in a previous episode,
where you were deleting <br /> in address contexts where it obviously should
have been treated as importantly as a comma or even (would you believe) a line
break.
The example botched output was "... St Johns WoodLondon ..." IIRC.
Prevention is better than cure; try to find out if your earlier code is causing
this problem.
The following regex doesn't seem to work:
Regexes do work. If the outcome is not what you expected, it is your
eexpectation-to-regex translator that is not working.
What does it do? Does it match zero addresses, all addresses, many addresses
that contain a 5-digit number /followed/ by a space, something else? Could you
use the answer to that question to narrow in on the problem with your regex?
#Check for any non-space before a five-digit number
re_bad_address = re.compile('([^\s].)(\d{5}) ',re.I | re.S | re.M)
The comment is quite incorrect. After removing the fog of useless parentheses,
the regex says:
[^\s] -- one non-whitespace character (better written as \S)
.. -- any character (more or less, see later) (why?)
\d{5} -- 5 digits
-- a space (why?)
Then there's a hail of flags:
re.I (ignore case) -- irrelevant
re.S (DOTALL) -- makes your pointless . match any character (instead of any
character except newline) Do you have any newlines in your addresses?
re.M (MULTILINE) -- I'm 99% sure you don't need this either.
I also tried ([^ ].), to no avail.
If not-whitespace doesn't match, changing it to not-space doesn't help.
What is the right way to tell the Python re module to check for any
non-space character?
r'[^ ]' -- but that's NOT the question you should be asking.
HTH,
John