[regex] How to check for non-space character?

G

Gilles Ganault

Hello

Some of the adresses are missing a space between the streetname and
the ZIP code, eg. "123 Main Street01159 Someville"

The following regex doesn't seem to work:

#Check for any non-space before a five-digit number
re_bad_address = re.compile('([^\s].)(\d{5}) ',re.I | re.S | re.M)

I also tried ([^ ].), to no avail.

What is the right way to tell the Python re module to check for any
non-space character?

Thank you.
 
T

Tim Chase

Gilles said:
Hello

Some of the adresses are missing a space between the streetname and
the ZIP code, eg. "123 Main Street01159 Someville"

The following regex doesn't seem to work:

#Check for any non-space before a five-digit number
re_bad_address = re.compile('([^\s].)(\d{5}) ',re.I | re.S | re.M) -------------------------------------^


I also tried ([^ ].), to no avail. --------------------^

What is the right way to tell the Python re module to check for any
non-space character?

It looks like it's these periods that are throwing you off. Just
remove them. For a 3rd syntax:

(\S)(\d{5})

the \S (capital, instead of "\s") is "any NON-white-space character"

-tkc
 
J

John Machin

Gilles Ganault said:
Hello

Some of the adresses are missing a space between the streetname and
the ZIP code, eg. "123 Main Street01159 Someville"

This problem appears very similar to the one you had in a previous episode,
where you were deleting <br /> in address contexts where it obviously should
have been treated as importantly as a comma or even (would you believe) a line
break.

The example botched output was "... St Johns WoodLondon ..." IIRC.

Prevention is better than cure; try to find out if your earlier code is causing
this problem.
The following regex doesn't seem to work:

Regexes do work. If the outcome is not what you expected, it is your
eexpectation-to-regex translator that is not working.

What does it do? Does it match zero addresses, all addresses, many addresses
that contain a 5-digit number /followed/ by a space, something else? Could you
use the answer to that question to narrow in on the problem with your regex?
#Check for any non-space before a five-digit number
re_bad_address = re.compile('([^\s].)(\d{5}) ',re.I | re.S | re.M)

The comment is quite incorrect. After removing the fog of useless parentheses,
the regex says:
[^\s] -- one non-whitespace character (better written as \S)
.. -- any character (more or less, see later) (why?)
\d{5} -- 5 digits
-- a space (why?)

Then there's a hail of flags:
re.I (ignore case) -- irrelevant
re.S (DOTALL) -- makes your pointless . match any character (instead of any
character except newline) Do you have any newlines in your addresses?
re.M (MULTILINE) -- I'm 99% sure you don't need this either.
I also tried ([^ ].), to no avail.

If not-whitespace doesn't match, changing it to not-space doesn't help.
What is the right way to tell the Python re module to check for any
non-space character?

r'[^ ]' -- but that's NOT the question you should be asking.

HTH,
John
 
G

Gilles Ganault

It looks like it's these periods that are throwing you off. Just
remove them. For a 3rd syntax:

(\S)(\d{5})

the \S (capital, instead of "\s") is "any NON-white-space character"

Thanks guys for the tips.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top