Regex driving me crazy...

J · Apr 7, 2010

Can someone make me un-crazy?

I have a bit of code that right now, looks like this:

status = getoutput('smartctl -l selftest /dev/sda').splitlines()[6]
status = re.sub(' (?= )(?=([^"]*"[^"]*")*[^"]*$)', ":",status)
print status

Basically, it pulls the first actual line of data from the return you
get when you use smartctl to look at a hard disk's selftest log.

The raw data looks like this:

# 1 Short offline Completed without error 00% 679 -

Unfortunately, all that whitespace is arbitrary single space
characters. And I am interested in the string that appears in the
third column, which changes as the test runs and then completes. So
in the example, "Completed without error"

The regex I have up there doesn't quite work, as it seems to be
subbing EVERY space (or at least in instances of more than one space)
to a ':' like this:

# 1: Short offline:::::: Completed without error:::::: 00%:::::: 679:::::::: -

Ultimately, what I'm trying to do is either replace any space that is

one space wiht a delimiter, then split the result into a list and

get the third item.

OR, if there's a smarter, shorter, or better way of doing it, I'd love to know.

The end result should pull the whole string in the middle of that
output line, and then I can use that to compare to a list of possible
output strings to determine if the test is still running, has
completed successfully, or failed.

Unfortunately, my google-fu fails right now, and my Regex powers were
always rather weak anyway...

So any ideas on what the best way to proceed with this would be?

Grant Edwards · Apr 7, 2010

Can someone make me un-crazy?

Definitely. Regex is driving you crazy, so don't use a regex.

inputString = "# 1 Short offline Completed without error 00% 679 -"

print ' '.join(inputString.split()[4:-3])

So any ideas on what the best way to proceed with this would be?

Anytime you have a problem with a regex, the first thing you should
ask yourself: "do I really, _really_ need a regex?

Hint: the answer is usually "no".

Patrick Maupin · Apr 8, 2010

Can someone make me un-crazy?

I have a bit of code that right now, looks like this:

status = getoutput('smartctl -l selftest /dev/sda').splitlines()[6]
status = re.sub(' (?= )(?=([^"]*"[^"]*")*[^"]*$)', ":",status)
print status

Basically, it pulls the first actual line of data from the return you
get when you use smartctl to look at a hard disk's selftest log.

The raw data looks like this:

# 1 Short offline Completed without error 00% 679 -

Unfortunately, all that whitespace is arbitrary single space
characters. And I am interested in the string that appears in the
third column, which changes as the test runs and then completes. So
in the example, "Completed without error"

The regex I have up there doesn't quite work, as it seems to be
subbing EVERY space (or at least in instances of more than one space)
to a ':' like this:

# 1: Short offline:::::: Completed without error:::::: 00%:::::: 679:::::::: -

Ultimately, what I'm trying to do is either replace any space that is> one space wiht a delimiter, then split the result into a list and

get the third item.

OR, if there's a smarter, shorter, or better way of doing it, I'd love to know.

The end result should pull the whole string in the middle of that
output line, and then I can use that to compare to a list of possible
output strings to determine if the test is still running, has
completed successfully, or failed.

Unfortunately, my google-fu fails right now, and my Regex powers were
always rather weak anyway...

So any ideas on what the best way to proceed with this would be?

You mean like this?

import re
re.split(' {2,}', '# 1 Short offline Completed without error 00%') ['# 1', 'Short offline', 'Completed without error', '00%']

Click to expand...

Click to expand...

Regards,
Pat

Patrick Maupin · Apr 8, 2010

Can someone make me un-crazy?

Click to expand...

Definitely. Regex is driving you crazy, so don't use a regex.

inputString = "# 1 Short offline Completed without error 00% 679 -"

print ' '.join(inputString.split()[4:-3])

So any ideas on what the best way to proceed with this would be?

Click to expand...

Anytime you have a problem with a regex, the first thing you should
ask yourself: "do I really, _really_ need a regex?

Hint: the answer is usually "no".

OK, fine. Post a better solution to this problem than:

import re
re.split(' {2,}', '# 1 Short offline Completed without error 00%') ['# 1', 'Short offline', 'Completed without error', '00%']

Click to expand...

Click to expand...

Regards,
Pat

Patrick Maupin · Apr 8, 2010

Can someone make me un-crazy?

Click to expand...

I have a bit of code that right now, looks like this:

Click to expand...

status = getoutput('smartctl -l selftest /dev/sda').splitlines()[6]
status = re.sub(' (?= )(?=([^"]*"[^"]*")*[^"]*$)', ":",status)
print status

Click to expand...

Basically, it pulls the first actual line of data from the return you
get when you use smartctl to look at a hard disk's selftest log.

Click to expand...

The raw data looks like this:

Click to expand...

# 1 Short offline Completed without error 00% 679 -

Click to expand...

Unfortunately, all that whitespace is arbitrary single space
characters. And I am interested in the string that appears in the
third column, which changes as the test runs and then completes. So
in the example, "Completed without error"

Click to expand...

The regex I have up there doesn't quite work, as it seems to be
subbing EVERY space (or at least in instances of more than one space)
to a ':' like this:

Click to expand...

# 1: Short offline:::::: Completed without error:::::: 00%:::::: 679:::::::: -

Click to expand...

Ultimately, what I'm trying to do is either replace any space that is> one space wiht a delimiter, then split the result into a list and

Click to expand...

get the third item.

Click to expand...

OR, if there's a smarter, shorter, or better way of doing it, I'd love to know.

Click to expand...

The end result should pull the whole string in the middle of that
output line, and then I can use that to compare to a list of possible
output strings to determine if the test is still running, has
completed successfully, or failed.

Click to expand...

Unfortunately, my google-fu fails right now, and my Regex powers were
always rather weak anyway...

Click to expand...

So any ideas on what the best way to proceed with this would be?

Click to expand...

You mean like this?

['# 1', 'Short offline', 'Completed without error', '00%']

Regards,
Pat

BTW, although I find it annoying when people say "don't do that" when
"that" is a perfectly good thing to do, and although I also find it
annoying when people tell you what not to do without telling you what
*to* do, and although I find the regex solution to this problem to be
quite clean, the equivalent non-regex solution is not terrible, so I
will present it as well, for your viewing pleasure:

[x for x in '# 1 Short offline Completed without error 00%'.split(' ') if x.strip()]

Click to expand...

Click to expand...

['# 1', 'Short offline', ' Completed without error', ' 00%']

Regards,
Pat

James Stroud · Apr 8, 2010

Patrick said:
BTW, although I find it annoying when people say "don't do that" when
"that" is a perfectly good thing to do, and although I also find it
annoying when people tell you what not to do without telling you what
*to* do, and although I find the regex solution to this problem to be
quite clean, the equivalent non-regex solution is not terrible

I propose a new way to answer questions on c.l.python that will (1) give respondents the pleasure of vague admonishment and (2) actually answer the question. The way I propose utilizes the double negative. For example:

"You are doing it wrong! Don't not do <code>re.split('\s{2,}', s[2])</code>."

Please answer this way in the future.

Thank you,
James

Patrick Maupin · Apr 8, 2010

Patrick said:
Patrick said:

BTW, although I find it annoying when people say "don't do that" when
"that" is a perfectly good thing to do, and although I also find it
annoying when people tell you what not to do without telling you what
*to* do, and although I find the regex solution to this problem to be
quite clean, the equivalent non-regex solution is not terrible

Click to expand...

I propose a new way to answer questions on c.l.python that will (1) give respondents the pleasure of vague admonishment and (2) actually answer the question. The way I propose utilizes the double negative. For example:

"You are doing it wrong! Don't not do <code>re.split('\s{2,}', s[2])</code>."

Please answer this way in the future.

I most certainly will not consider when that isn't warranted!

OTOH, in general I am more interested in admonishing the authors of
the pseudo-answers than I am the authors of the questions, despite the
fact that I find this hilarious:

http://despair.com/cluelessness.html

Regards,
Pat

Grant Edwards · Apr 8, 2010

Can someone make me un-crazy?

Click to expand...

Definitely. ?Regex is driving you crazy, so don't use a regex.

? inputString = "# 1 ?Short offline ? ? ? Completed without error ? ? 00% ? ? ? 679 ? ? ? ? -"

? print ' '.join(inputString.split()[4:-3])

Click to expand...

[...]

OK, fine. Post a better solution to this problem than:
['# 1', 'Short offline', 'Completed without error', '00%']

OK, I'll bite: what's wrong with the solution I already posted?

Grant Edwards · Apr 8, 2010

Patrick said:
Patrick said:

BTW, although I find it annoying when people say "don't do that" when
"that" is a perfectly good thing to do, and although I also find it
annoying when people tell you what not to do without telling you what
*to* do, and although I find the regex solution to this problem to be
quite clean, the equivalent non-regex solution is not terrible

Click to expand...

I propose a new way to answer questions on c.l.python that will (1) give respondents the pleasure of vague admonishment and (2) actually answer the question. The way I propose utilizes the double negative. For example:

"You are doing it wrong! Don't not do <code>re.split('\s{2,}', s[2])</code>."

Please answer this way in the future.

I will certain try to avoid not answering in a manner not unlike that.

Patrick Maupin · Apr 8, 2010

Can someone make me un-crazy?
Definitely. ?Regex is driving you crazy, so don't use a regex.
? inputString = "# 1 ?Short offline ? ? ? Completed without error ? ? 00% ? ? ? 679 ? ? ? ? -"
? print ' '.join(inputString.split()[4:-3])

Click to expand...

[...]

OK, fine. Post a better solution to this problem than:

import re
re.split(' {2,}', '# 1 Short offline Completed without error 00%')

Click to expand...

['# 1', 'Short offline', 'Completed without error', '00%']

Click to expand...

OK, I'll bite: what's wrong with the solution I already posted?

Sorry, my eyes completely missed your one-liner, so my criticism about
not posting a solution was unwarranted. I don't think you and I read
the problem the same way (which is probably why I didn't notice your
solution -- because it wasn't solving the problem I thought I saw).

When I saw "And I am interested in the string that appears in the
third column, which changes as the test runs and then completes" I
assumed that, not only could that string change, but so could the one
before it.

I guess my base assumption that anything with words in it could
change. I was looking at the OP's attempt at a solution, and he
obviously felt he needed to see two or more spaces as an item
delimiter.

(And I got testy because of seeing other IMO unwarranted denigration
of re on the list lately.)

Regards,
Pat

Steven D'Aprano · Apr 8, 2010

BTW, although I find it annoying when people say "don't do that" when
"that" is a perfectly good thing to do, and although I also find it
annoying when people tell you what not to do without telling you what
*to* do,

Grant did give a perfectly good solution.

and although I find the regex solution to this problem to be
quite clean, the equivalent non-regex solution is not terrible, so I
will present it as well, for your viewing pleasure:

[x for x in '# 1 Short offline Completed without error

Click to expand...

Click to expand...

00%'.split(' ') if x.strip()]
['# 1', 'Short offline', ' Completed without error', ' 00%']

This is one of the reasons we're so often suspicious of re solutions:

s = '# 1 Â Short offline Â Â Â Completed without error Â Â Â 00%'
tre = Timer("re.split(' {2,}', s)", .... "import re; from __main__ import s")
tsplit = Timer("[x for x in s.split(' ') if x.strip()]", .... "from __main__ import s")

re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()] True

min(tre.repeat(repeat=5)) 6.1224789619445801
min(tsplit.repeat(repeat=5))

Click to expand...

Click to expand...

1.8338048458099365

Even when they are correct and not unreadable line-noise, regexes tend to
be slow. And they get worse as the size of the input increases:
4.6444299221038818

And this isn't even one of the pathological O(N**2) or O(2**N) regexes.

Don't get me wrong -- regexes are a useful tool. But if your first
instinct is to write a regex, you're doing it wrong.

A related problem is Perl's over-reliance on regular expressions
that is exaggerated by advocating regex-based solution in almost
all O'Reilly books. The latter until recently were the most
authoritative source of published information about Perl.

While simple regular expression is a beautiful thing and can
simplify operations with string considerably, overcomplexity in
regular expressions is extremly dangerous: it cannot serve a basis
for serious, professional programming, it is fraught with pitfalls,
a big semantic mess as a result of outgrowing its primary purpose.
Diagnostic for errors in regular expressions is even weaker then
for the language itself and here many things are just go unnoticed.
[end quote]

http://www.softpanorama.org/Scripting/Perlbook/Ch01/
place_of_perl_among_other_lang.shtml

Even Larry Wall has criticised Perl's regex culture:

http://dev.perl.org/perl6/doc/design/apo/A05.html

J · Apr 8, 2010

When I saw "And I am interested in the string that appears in the
third column, which changes as the test runs and then completes" I
assumed that, not only could that string change, but so could the one
before it.

I guess my base assumption that anything with words in it could
change. I was looking at the OP's attempt at a solution, and he
obviously felt he needed to see two or more spaces as an item
delimiter.

I apologize for the confusion, Pat...

I could have worded that better, but at that point I was A:
Frustrated, B: starving, and C: had my wife nagging me to stop working
to come get something to eat ;-)

What I meant was, in that output string, the phrase in the middle
could change in length...
After looking at the source code for smartctl (part of the
smartmontools package for you linux people) I found the switch that
creates those status messages.... they vary in character length, some
with non-text characters like ( and ) and /, and have either 3 or 4
words...

The spaces between each column, instead of being a fixed number of
spaces each, were seemingly arbitrarily created... there may be 4
spaces between two columns or there may be 9, or 7 or who knows what,
and since they were all being treated as individual spaces instead of
tabs or something, I was having trouble splitting the output into
something that was easy to parse (at least in my mind it seemed that
way).

Anyway, that's that... and I do apologize if my original post was
confusing at all...

Cheers
Jeff

Patrick Maupin · Apr 8, 2010

Grant did give a perfectly good solution.

Yeah, I noticed later and apologized for that. What he gave will work
perfectly if the only data that changes the number of words is the
data the OP is looking for. This may or may not be true. I don't
know anything about the program generating the data, but I did notice
that the OP's attempt at an answer indicated that the OP felt (rightly
or wrongly) he needed to split on two or more spaces.

and although I find the regex solution to this problem to be
quite clean, the equivalent non-regex solution is not terrible, so I
will present it as well, for your viewing pleasure:

[x for x in '# 1 Short offline Completed without error

Click to expand...

00%'.split(' ') if x.strip()]
['# 1', 'Short offline', ' Completed without error', ' 00%']

Click to expand...

This is one of the reasons we're so often suspicious of re solutions:

... "import re; from __main__ import s")>>> tsplit = Timer("[x for x in s.split(' ') if x.strip()]",

... "from __main__ import s")

re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]

Click to expand...

True

min(tre.repeat(repeat=5)) 6.1224789619445801
min(tsplit.repeat(repeat=5))

Click to expand...

Click to expand...

1.8338048458099365

Even when they are correct and not unreadable line-noise, regexes tend to
be slow. And they get worse as the size of the input increases:

0.41538596153259277

Click to expand...

4.6444299221038818

And this isn't even one of the pathological O(N**2) or O(2**N) regexes.

Don't get me wrong -- regexes are a useful tool. But if your first
instinct is to write a regex, you're doing it wrong.

A related problem is Perl's over-reliance on regular expressions
that is exaggerated by advocating regex-based solution in almost
all O'Reilly books. The latter until recently were the most
authoritative source of published information about Perl.

While simple regular expression is a beautiful thing and can
simplify operations with string considerably, overcomplexity in
regular expressions is extremly dangerous: it cannot serve a basis
for serious, professional programming, it is fraught with pitfalls,
a big semantic mess as a result of outgrowing its primary purpose..
Diagnostic for errors in regular expressions is even weaker then
for the language itself and here many things are just go unnoticed.
[end quote]

http://www.softpanorama.org/Scripting/Perlbook/Ch01/
place_of_perl_among_other_lang.shtml

Even Larry Wall has criticised Perl's regex culture:

http://dev.perl.org/perl6/doc/design/apo/A05.html

Click to expand...

Bravo!!! Good data, quotes, references, all good stuff!

I absolutely agree that regex shouldn't always be the first thing you
reach for, but I was reading way too much unsubstantiated "this is
bad. Don't do it." on the subject recently. In particular, when
people say "Don't use regex. Use PyParsing!" It may be good advice
in the right context, but it's a bit disingenuous not to mention that
PyParsing will use regex under the covers...

Regards,
Pat

Grant Edwards · Apr 8, 2010

Sorry, my eyes completely missed your one-liner, so my criticism about
not posting a solution was unwarranted. I don't think you and I read
the problem the same way (which is probably why I didn't notice your
solution -- because it wasn't solving the problem I thought I saw).

No worries.

When I saw "And I am interested in the string that appears in the
third column, which changes as the test runs and then completes" I
assumed that, not only could that string change, but so could the one
before it.

If that's the case, my solution won't work right.

I guess my base assumption that anything with words in it could
change. I was looking at the OP's attempt at a solution, and he
obviously felt he needed to see two or more spaces as an item
delimiter.

If the requirement is indeed two or more spaces as a delimiter with
spaces allowed in any field, then a regular expression split is
probably the best solution.

Patrick Maupin · Apr 8, 2010

This is one of the reasons we're so often suspicious of re solutions:

... "import re; from __main__ import s")>>> tsplit = Timer("[x for x in s.split(' ') if x.strip()]",

... "from __main__ import s")

re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]

Click to expand...

True

min(tre.repeat(repeat=5)) 6.1224789619445801
min(tsplit.repeat(repeat=5))

Click to expand...

Click to expand...

1.8338048458099365

I will confess that, in my zeal to defend re, I gave a simple one-
liner, rather than the more optimized version:
.... "import re; from __main__ import s; splitter =
re.compile(' {2,}').split")

tsplit = Timer("[x for x in s.split(' ') if x.strip()]", .... "from __main__ import s")
min(tre.repeat(repeat=5)) 1.893190860748291
min(tsplit.repeat(repeat=5))

Click to expand...

Click to expand...

2.0661051273345947

You're right that if you have an 800K byte string, re doesn't perform
as well as split, but the delta is only a few percent.
14.596404075622559

Regards,
Pat

Patrick Maupin · Apr 8, 2010

On Apr 7, 9:51 pm, Steven D'Aprano

BTW, I don't know how you got 'True' here.

re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]

Click to expand...

Click to expand...

True

You must not have s set up to be the string given by the OP. I just
realized there was an error in my non-regexp example, that actually
manifests itself with the test data:

import re
s = '# 1 Short offline Completed without error 00%'
re.split(' {2,}', s) ['# 1', 'Short offline', 'Completed without error', '00%']
[x for x in s.split(' ') if x.strip()] ['# 1', 'Short offline', ' Completed without error', ' 00%']
re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]

Click to expand...

Click to expand...

False

To fix it requires something like:

[x.strip() for x in s.split(' ') if x.strip()]

or:

[x for x in [x.strip() for x in s.split(' ')] if x]

I haven't timed either one of these, but given that the broken
original one was slower than the simpler:

splitter = re.compile(' {2,}').split
splitter(s)

on strings of "normal" length, and given that nobody noticed this bug
right away (even though it was in the printout on my first message,
heh), I think that this shows that (here, let me qualify this
carefully), at least in some cases, the first regexp that comes to my
mind can be prettier, shorter, faster, less bug-prone, etc. than the
first non-regexp that comes to my mind...

Regards,
Pat

Lie Ryan · Apr 8, 2010

(And I got testy because of seeing other IMO unwarranted denigration
of re on the list lately.)

Why am I seeing a lot of this pattern lately:

OP: Got problem with string
+- A: Suggested a regex-based solution
+- B: Quoted "Some people ... regex ... two problems."

or

OP: Writes some regex, found problem
+- A: Quoted "Some people ... regex ... two problems."
+- B: Supplied regex-based solution, clean one
+- A: Suggested PyParsing (or similar)

Steven D'Aprano · Apr 8, 2010

On Apr 7, 9:51Â pm, Steven D'Aprano

BTW, I don't know how you got 'True' here.

re.split(' {2,}', s) == [x for x in s.split(' Â ') if x.strip()]

Click to expand...

True

Click to expand...

It was a copy and paste from the interactive interpreter. Here it is, in
a fresh session:

[steve@wow-wow ~]$ python
Python 2.5 (r25:51908, Nov 6 2007, 16:54:01)
[GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import re
s = '# 1 Â Short offline Â Â Â Completed without error Â Â Â 00%'
re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()] True

Click to expand...

Click to expand...

Now I copy-and-paste from your latest post to do it again:

s = '# 1 Â Short offline Â Â Â Completed without error Â Â Â 00%'
re.split(' {2,}', s) == [x for x in s.split(' Â ') if x.strip()]

Click to expand...

Click to expand...

False

Weird, huh?

And here's the answer: somewhere along the line, something changed the
whitespace in the string into non-spaces:
'# 1 \xc2\xa0Short offline \xc2\xa0 \xc2\xa0 \xc2\xa0 Completed without
error \xc2\xa0 \xc2\xa0 \xc2\xa0 00%'

I blame Google. I don't know how they did it, but I'm sure it was them!
*wink*

By the way, let's not forget that the string could be fixed-width fields
padded with spaces, in which case the right solution almost certainly
will be:

s = '# 1 Â Short offline Â Â Â Completed without error Â Â Â 00%'
result = s[25:55].rstrip()

Even in 2010, there are plenty of programs that export data using fixed
width fields.

Grant Edwards · Apr 8, 2010

Even in 2010, there are plenty of programs that export data using fixed
width fields.

If you want the columns to line up as the data changes, that's pretty
much the only way to go.

Tim Chase · Apr 8, 2010

Lie said:
Why am I seeing a lot of this pattern lately:

OP: Got problem with string
+- A: Suggested a regex-based solution
+- B: Quoted "Some people ... regex ... two problems."

or

OP: Writes some regex, found problem
+- A: Quoted "Some people ... regex ... two problems."
+- B: Supplied regex-based solution, clean one
+- A: Suggested PyParsing (or similar)

There's a spectrum of parsing solutions:

- string.split() or string[slice] notations handle simple cases
and are built-in

- regexps handle more complex parsing tasks and are also built in

- pyparsing handles far more complex parsing tasks (nesting, etc)
but isn't built-in

The above dialog tends to appear when the task isn't in the
sweet-spot of regexps. Either it's sufficiently simple that
simple split/slice notation will do, or (at the other end of the
spectrum) the effort to get it working with a regexp is hairy and
convoluted, worthy of a more readable solution implemented with
pyparsing. The problem comes from people thinking that regexps
are the right solution to *every* problem...often demonstrated by
the OP writing "how do I write a regexp to solve this
<non-regexp-optimal> problem" assuming regexps are the right tool
for everything.

There are some problem-classes for which regexps are the *right*
solution, and I don't see as much of your example dialog in those
cases.

-tkc

pexpect regex help	4	Feb 21, 2007
Urgent Q: 'correct' coding of a Webcontrol Designer driving me nuts...	1	May 31, 2004
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
"as" keyword woes	21	Dec 3, 2008
PyWart: PEP8: a seething cauldron of inconsistencies.	1	Jul 28, 2011
PyWart: PEP8: A cauldron of inconsistencies.	7	Jul 27, 2011
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
PEP 350: Codetags	20	Sep 26, 2005

Regex driving me crazy...

J

Grant Edwards

Patrick Maupin

Patrick Maupin

Patrick Maupin

James Stroud

Patrick Maupin

Grant Edwards

Grant Edwards

Patrick Maupin

Steven D'Aprano

J

Patrick Maupin

Grant Edwards

Patrick Maupin

Patrick Maupin

Lie Ryan

Steven D'Aprano

Grant Edwards

Tim Chase

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads