Regex driving me crazy...

J

J

Can someone make me un-crazy?

I have a bit of code that right now, looks like this:

status = getoutput('smartctl -l selftest /dev/sda').splitlines()[6]
status = re.sub(' (?= )(?=([^"]*"[^"]*")*[^"]*$)', ":",status)
print status

Basically, it pulls the first actual line of data from the return you
get when you use smartctl to look at a hard disk's selftest log.

The raw data looks like this:

# 1 Short offline Completed without error 00% 679 -

Unfortunately, all that whitespace is arbitrary single space
characters. And I am interested in the string that appears in the
third column, which changes as the test runs and then completes. So
in the example, "Completed without error"

The regex I have up there doesn't quite work, as it seems to be
subbing EVERY space (or at least in instances of more than one space)
to a ':' like this:

# 1: Short offline:::::: Completed without error:::::: 00%:::::: 679:::::::: -

Ultimately, what I'm trying to do is either replace any space that is
one space wiht a delimiter, then split the result into a list and
get the third item.

OR, if there's a smarter, shorter, or better way of doing it, I'd love to know.

The end result should pull the whole string in the middle of that
output line, and then I can use that to compare to a list of possible
output strings to determine if the test is still running, has
completed successfully, or failed.

Unfortunately, my google-fu fails right now, and my Regex powers were
always rather weak anyway...

So any ideas on what the best way to proceed with this would be?
 
G

Grant Edwards

Can someone make me un-crazy?

Definitely. Regex is driving you crazy, so don't use a regex.

inputString = "# 1 Short offline Completed without error 00% 679 -"

print ' '.join(inputString.split()[4:-3])
So any ideas on what the best way to proceed with this would be?

Anytime you have a problem with a regex, the first thing you should
ask yourself: "do I really, _really_ need a regex?

Hint: the answer is usually "no".
 
P

Patrick Maupin

Can someone make me un-crazy?

I have a bit of code that right now, looks like this:

status = getoutput('smartctl -l selftest /dev/sda').splitlines()[6]
        status = re.sub(' (?= )(?=([^"]*"[^"]*")*[^"]*$)', ":",status)
        print status

Basically, it pulls the first actual line of data from the return you
get when you use smartctl to look at a hard disk's selftest log.

The raw data looks like this:

# 1  Short offline       Completed without error       00%       679         -

Unfortunately, all that whitespace is arbitrary single space
characters.  And I am interested in the string that appears in the
third column, which changes as the test runs and then completes.  So
in the example, "Completed without error"

The regex I have up there doesn't quite work, as it seems to be
subbing EVERY space (or at least in instances of more than one space)
to a ':' like this:

# 1: Short offline:::::: Completed without error:::::: 00%:::::: 679:::::::: -

Ultimately, what I'm trying to do is either replace any space that is> one space wiht a delimiter, then split the result into a list and

get the third item.

OR, if there's a smarter, shorter, or better way of doing it, I'd love to know.

The end result should pull the whole string in the middle of that
output line, and then I can use that to compare to a list of possible
output strings to determine if the test is still running, has
completed successfully, or failed.

Unfortunately, my google-fu fails right now, and my Regex powers were
always rather weak anyway...

So any ideas on what the best way to proceed with this would be?

You mean like this?
import re
re.split(' {2,}', '# 1 Short offline Completed without error 00%') ['# 1', 'Short offline', 'Completed without error', '00%']

Regards,
Pat
 
P

Patrick Maupin

Can someone make me un-crazy?

Definitely.  Regex is driving you crazy, so don't use a regex.

  inputString = "# 1  Short offline       Completed without error     00%       679         -"

  print ' '.join(inputString.split()[4:-3])
So any ideas on what the best way to proceed with this would be?

Anytime you have a problem with a regex, the first thing you should
ask yourself:  "do I really, _really_ need a regex?

Hint: the answer is usually "no".

OK, fine. Post a better solution to this problem than:
import re
re.split(' {2,}', '# 1 Short offline Completed without error 00%') ['# 1', 'Short offline', 'Completed without error', '00%']

Regards,
Pat
 
P

Patrick Maupin

Can someone make me un-crazy?
I have a bit of code that right now, looks like this:
status = getoutput('smartctl -l selftest /dev/sda').splitlines()[6]
        status = re.sub(' (?= )(?=([^"]*"[^"]*")*[^"]*$)', ":",status)
        print status
Basically, it pulls the first actual line of data from the return you
get when you use smartctl to look at a hard disk's selftest log.
The raw data looks like this:
# 1  Short offline       Completed without error       00%       679         -
Unfortunately, all that whitespace is arbitrary single space
characters.  And I am interested in the string that appears in the
third column, which changes as the test runs and then completes.  So
in the example, "Completed without error"
The regex I have up there doesn't quite work, as it seems to be
subbing EVERY space (or at least in instances of more than one space)
to a ':' like this:
# 1: Short offline:::::: Completed without error:::::: 00%:::::: 679:::::::: -
Ultimately, what I'm trying to do is either replace any space that is> one space wiht a delimiter, then split the result into a list and
get the third item.
OR, if there's a smarter, shorter, or better way of doing it, I'd love to know.
The end result should pull the whole string in the middle of that
output line, and then I can use that to compare to a list of possible
output strings to determine if the test is still running, has
completed successfully, or failed.
Unfortunately, my google-fu fails right now, and my Regex powers were
always rather weak anyway...
So any ideas on what the best way to proceed with this would be?

You mean like this?

['# 1', 'Short offline', 'Completed without error', '00%']



Regards,
Pat

BTW, although I find it annoying when people say "don't do that" when
"that" is a perfectly good thing to do, and although I also find it
annoying when people tell you what not to do without telling you what
*to* do, and although I find the regex solution to this problem to be
quite clean, the equivalent non-regex solution is not terrible, so I
will present it as well, for your viewing pleasure:
[x for x in '# 1 Short offline Completed without error 00%'.split(' ') if x.strip()]
['# 1', 'Short offline', ' Completed without error', ' 00%']

Regards,
Pat
 
J

James Stroud

Patrick said:
BTW, although I find it annoying when people say "don't do that" when
"that" is a perfectly good thing to do, and although I also find it
annoying when people tell you what not to do without telling you what
*to* do, and although I find the regex solution to this problem to be
quite clean, the equivalent non-regex solution is not terrible

I propose a new way to answer questions on c.l.python that will (1) give respondents the pleasure of vague admonishment and (2) actually answer the question. The way I propose utilizes the double negative. For example:

"You are doing it wrong! Don't not do <code>re.split('\s{2,}', s[2])</code>."

Please answer this way in the future.

Thank you,
James
 
P

Patrick Maupin

Patrick said:
BTW, although I find it annoying when people say "don't do that" when
"that" is a perfectly good thing to do, and although I also find it
annoying when people tell you what not to do without telling you what
*to* do, and although I find the regex solution to this problem to be
quite clean, the equivalent non-regex solution is not terrible

I propose a new way to answer questions on c.l.python that will (1) give respondents the pleasure of vague admonishment and (2) actually answer the question. The way I propose utilizes the double negative. For example:

"You are doing it wrong! Don't not do <code>re.split('\s{2,}', s[2])</code>."

Please answer this way in the future.

I most certainly will not consider when that isn't warranted!

OTOH, in general I am more interested in admonishing the authors of
the pseudo-answers than I am the authors of the questions, despite the
fact that I find this hilarious:

http://despair.com/cluelessness.html

Regards,
Pat
 
G

Grant Edwards

Can someone make me un-crazy?

Definitely. ?Regex is driving you crazy, so don't use a regex.

? inputString = "# 1 ?Short offline ? ? ? Completed without error ? ? 00% ? ? ? 679 ? ? ? ? -"

? print ' '.join(inputString.split()[4:-3])
[...]

OK, fine. Post a better solution to this problem than:
['# 1', 'Short offline', 'Completed without error', '00%']

OK, I'll bite: what's wrong with the solution I already posted?
 
G

Grant Edwards

Patrick said:
BTW, although I find it annoying when people say "don't do that" when
"that" is a perfectly good thing to do, and although I also find it
annoying when people tell you what not to do without telling you what
*to* do, and although I find the regex solution to this problem to be
quite clean, the equivalent non-regex solution is not terrible

I propose a new way to answer questions on c.l.python that will (1) give respondents the pleasure of vague admonishment and (2) actually answer the question. The way I propose utilizes the double negative. For example:

"You are doing it wrong! Don't not do <code>re.split('\s{2,}', s[2])</code>."

Please answer this way in the future.

I will certain try to avoid not answering in a manner not unlike that.
 
P

Patrick Maupin

Can someone make me un-crazy?
Definitely. ?Regex is driving you crazy, so don't use a regex.
? inputString = "# 1 ?Short offline ? ? ? Completed without error ? ? 00% ? ? ? 679 ? ? ? ? -"
? print ' '.join(inputString.split()[4:-3])
[...]

OK, fine.  Post a better solution to this problem than:
import re
re.split(' {2,}', '# 1  Short offline       Completed without error       00%')
['# 1', 'Short offline', 'Completed without error', '00%']

OK, I'll bite: what's wrong with the solution I already posted?

Sorry, my eyes completely missed your one-liner, so my criticism about
not posting a solution was unwarranted. I don't think you and I read
the problem the same way (which is probably why I didn't notice your
solution -- because it wasn't solving the problem I thought I saw).

When I saw "And I am interested in the string that appears in the
third column, which changes as the test runs and then completes" I
assumed that, not only could that string change, but so could the one
before it.

I guess my base assumption that anything with words in it could
change. I was looking at the OP's attempt at a solution, and he
obviously felt he needed to see two or more spaces as an item
delimiter.

(And I got testy because of seeing other IMO unwarranted denigration
of re on the list lately.)

Regards,
Pat
 
S

Steven D'Aprano

BTW, although I find it annoying when people say "don't do that" when
"that" is a perfectly good thing to do, and although I also find it
annoying when people tell you what not to do without telling you what
*to* do,

Grant did give a perfectly good solution.

and although I find the regex solution to this problem to be
quite clean, the equivalent non-regex solution is not terrible, so I
will present it as well, for your viewing pleasure:
[x for x in '# 1 Short offline Completed without error
00%'.split(' ') if x.strip()]
['# 1', 'Short offline', ' Completed without error', ' 00%']


This is one of the reasons we're so often suspicious of re solutions:

s = '# 1  Short offline       Completed without error       00%'
tre = Timer("re.split(' {2,}', s)", .... "import re; from __main__ import s")
tsplit = Timer("[x for x in s.split(' ') if x.strip()]", .... "from __main__ import s")

re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()] True


min(tre.repeat(repeat=5)) 6.1224789619445801
min(tsplit.repeat(repeat=5))
1.8338048458099365


Even when they are correct and not unreadable line-noise, regexes tend to
be slow. And they get worse as the size of the input increases:
4.6444299221038818


And this isn't even one of the pathological O(N**2) or O(2**N) regexes.

Don't get me wrong -- regexes are a useful tool. But if your first
instinct is to write a regex, you're doing it wrong.


A related problem is Perl's over-reliance on regular expressions
that is exaggerated by advocating regex-based solution in almost
all O'Reilly books. The latter until recently were the most
authoritative source of published information about Perl.

While simple regular expression is a beautiful thing and can
simplify operations with string considerably, overcomplexity in
regular expressions is extremly dangerous: it cannot serve a basis
for serious, professional programming, it is fraught with pitfalls,
a big semantic mess as a result of outgrowing its primary purpose.
Diagnostic for errors in regular expressions is even weaker then
for the language itself and here many things are just go unnoticed.
[end quote]

http://www.softpanorama.org/Scripting/Perlbook/Ch01/
place_of_perl_among_other_lang.shtml



Even Larry Wall has criticised Perl's regex culture:

http://dev.perl.org/perl6/doc/design/apo/A05.html
 
J

J

When I saw "And I am interested in the string that appears in the
third column, which changes as the test runs and then completes" I
assumed that, not only could that string change, but so could the one
before it.

I guess my base assumption that anything with words in it could
change.  I was looking at the OP's attempt at a solution, and he
obviously felt he needed to see two or more spaces as an item
delimiter.

I apologize for the confusion, Pat...

I could have worded that better, but at that point I was A:
Frustrated, B: starving, and C: had my wife nagging me to stop working
to come get something to eat ;-)

What I meant was, in that output string, the phrase in the middle
could change in length...
After looking at the source code for smartctl (part of the
smartmontools package for you linux people) I found the switch that
creates those status messages.... they vary in character length, some
with non-text characters like ( and ) and /, and have either 3 or 4
words...

The spaces between each column, instead of being a fixed number of
spaces each, were seemingly arbitrarily created... there may be 4
spaces between two columns or there may be 9, or 7 or who knows what,
and since they were all being treated as individual spaces instead of
tabs or something, I was having trouble splitting the output into
something that was easy to parse (at least in my mind it seemed that
way).

Anyway, that's that... and I do apologize if my original post was
confusing at all...

Cheers
Jeff
 
P

Patrick Maupin

Grant did give a perfectly good solution.

Yeah, I noticed later and apologized for that. What he gave will work
perfectly if the only data that changes the number of words is the
data the OP is looking for. This may or may not be true. I don't
know anything about the program generating the data, but I did notice
that the OP's attempt at an answer indicated that the OP felt (rightly
or wrongly) he needed to split on two or more spaces.
and although I find the regex solution to this problem to be
quite clean, the equivalent non-regex solution is not terrible, so I
will present it as well, for your viewing pleasure:
[x for x in '# 1  Short offline       Completed without error
      00%'.split('  ') if x.strip()]
['# 1', 'Short offline', ' Completed without error', ' 00%']

This is one of the reasons we're so often suspicious of re solutions:

... "import re; from __main__ import s")>>> tsplit = Timer("[x for x in s.split('  ') if x.strip()]",

... "from __main__ import s")
re.split(' {2,}', s) == [x for x in s.split('  ') if x.strip()]
True
min(tre.repeat(repeat=5)) 6.1224789619445801
min(tsplit.repeat(repeat=5))

1.8338048458099365

Even when they are correct and not unreadable line-noise, regexes tend to
be slow. And they get worse as the size of the input increases:
0.41538596153259277

4.6444299221038818

And this isn't even one of the pathological O(N**2) or O(2**N) regexes.

Don't get me wrong -- regexes are a useful tool. But if your first
instinct is to write a regex, you're doing it wrong.

   
    A related problem is Perl's over-reliance on regular expressions
    that is exaggerated by advocating regex-based solution in almost
    all O'Reilly books. The latter until recently were the most
    authoritative source of published information about Perl.

    While simple regular expression is a beautiful thing and can
    simplify operations with string considerably, overcomplexity in
    regular expressions is extremly dangerous: it cannot serve a basis
    for serious, professional programming, it is fraught with pitfalls,
    a big semantic mess as a result of outgrowing its primary purpose..
    Diagnostic for errors in regular expressions is even weaker then
    for the language itself and here many things are just go unnoticed.
    [end quote]

http://www.softpanorama.org/Scripting/Perlbook/Ch01/
place_of_perl_among_other_lang.shtml

Even Larry Wall has criticised Perl's regex culture:

http://dev.perl.org/perl6/doc/design/apo/A05.html

Bravo!!! Good data, quotes, references, all good stuff!

I absolutely agree that regex shouldn't always be the first thing you
reach for, but I was reading way too much unsubstantiated "this is
bad. Don't do it." on the subject recently. In particular, when
people say "Don't use regex. Use PyParsing!" It may be good advice
in the right context, but it's a bit disingenuous not to mention that
PyParsing will use regex under the covers...

Regards,
Pat
 
G

Grant Edwards

Sorry, my eyes completely missed your one-liner, so my criticism about
not posting a solution was unwarranted. I don't think you and I read
the problem the same way (which is probably why I didn't notice your
solution -- because it wasn't solving the problem I thought I saw).

No worries.
When I saw "And I am interested in the string that appears in the
third column, which changes as the test runs and then completes" I
assumed that, not only could that string change, but so could the one
before it.

If that's the case, my solution won't work right.
I guess my base assumption that anything with words in it could
change. I was looking at the OP's attempt at a solution, and he
obviously felt he needed to see two or more spaces as an item
delimiter.

If the requirement is indeed two or more spaces as a delimiter with
spaces allowed in any field, then a regular expression split is
probably the best solution.
 
P

Patrick Maupin

This is one of the reasons we're so often suspicious of re solutions:

... "import re; from __main__ import s")>>> tsplit = Timer("[x for x in s.split('  ') if x.strip()]",

... "from __main__ import s")
re.split(' {2,}', s) == [x for x in s.split('  ') if x.strip()]
True
min(tre.repeat(repeat=5)) 6.1224789619445801
min(tsplit.repeat(repeat=5))

1.8338048458099365

I will confess that, in my zeal to defend re, I gave a simple one-
liner, rather than the more optimized version:
.... "import re; from __main__ import s; splitter =
re.compile(' {2,}').split")
tsplit = Timer("[x for x in s.split(' ') if x.strip()]", .... "from __main__ import s")
min(tre.repeat(repeat=5)) 1.893190860748291
min(tsplit.repeat(repeat=5))
2.0661051273345947

You're right that if you have an 800K byte string, re doesn't perform
as well as split, but the delta is only a few percent.
14.596404075622559

Regards,
Pat
 
P

Patrick Maupin

On Apr 7, 9:51 pm, Steven D'Aprano

BTW, I don't know how you got 'True' here.
re.split(' {2,}', s) == [x for x in s.split('  ') if x.strip()]
True

You must not have s set up to be the string given by the OP. I just
realized there was an error in my non-regexp example, that actually
manifests itself with the test data:
import re
s = '# 1 Short offline Completed without error 00%'
re.split(' {2,}', s) ['# 1', 'Short offline', 'Completed without error', '00%']
[x for x in s.split(' ') if x.strip()] ['# 1', 'Short offline', ' Completed without error', ' 00%']
re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
False

To fix it requires something like:

[x.strip() for x in s.split(' ') if x.strip()]

or:

[x for x in [x.strip() for x in s.split(' ')] if x]

I haven't timed either one of these, but given that the broken
original one was slower than the simpler:

splitter = re.compile(' {2,}').split
splitter(s)

on strings of "normal" length, and given that nobody noticed this bug
right away (even though it was in the printout on my first message,
heh), I think that this shows that (here, let me qualify this
carefully), at least in some cases, the first regexp that comes to my
mind can be prettier, shorter, faster, less bug-prone, etc. than the
first non-regexp that comes to my mind...

Regards,
Pat
 
L

Lie Ryan

(And I got testy because of seeing other IMO unwarranted denigration
of re on the list lately.)


Why am I seeing a lot of this pattern lately:

OP: Got problem with string
+- A: Suggested a regex-based solution
+- B: Quoted "Some people ... regex ... two problems."

or

OP: Writes some regex, found problem
+- A: Quoted "Some people ... regex ... two problems."
+- B: Supplied regex-based solution, clean one
+- A: Suggested PyParsing (or similar)
 
S

Steven D'Aprano

On Apr 7, 9:51 pm, Steven D'Aprano

BTW, I don't know how you got 'True' here.
re.split(' {2,}', s) == [x for x in s.split('  ') if x.strip()]
True


It was a copy and paste from the interactive interpreter. Here it is, in
a fresh session:

[steve@wow-wow ~]$ python
Python 2.5 (r25:51908, Nov 6 2007, 16:54:01)
[GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import re
s = '# 1  Short offline       Completed without error       00%'
re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()] True


Now I copy-and-paste from your latest post to do it again:
s = '# 1  Short offline       Completed without error       00%'
re.split(' {2,}', s) == [x for x in s.split('  ') if x.strip()]
False


Weird, huh?

And here's the answer: somewhere along the line, something changed the
whitespace in the string into non-spaces:
'# 1 \xc2\xa0Short offline \xc2\xa0 \xc2\xa0 \xc2\xa0 Completed without
error \xc2\xa0 \xc2\xa0 \xc2\xa0 00%'


I blame Google. I don't know how they did it, but I'm sure it was them!
*wink*


By the way, let's not forget that the string could be fixed-width fields
padded with spaces, in which case the right solution almost certainly
will be:

s = '# 1  Short offline       Completed without error       00%'
result = s[25:55].rstrip()

Even in 2010, there are plenty of programs that export data using fixed
width fields.
 
G

Grant Edwards

Even in 2010, there are plenty of programs that export data using fixed
width fields.

If you want the columns to line up as the data changes, that's pretty
much the only way to go.
 
T

Tim Chase

Lie said:
Why am I seeing a lot of this pattern lately:

OP: Got problem with string
+- A: Suggested a regex-based solution
+- B: Quoted "Some people ... regex ... two problems."

or

OP: Writes some regex, found problem
+- A: Quoted "Some people ... regex ... two problems."
+- B: Supplied regex-based solution, clean one
+- A: Suggested PyParsing (or similar)

There's a spectrum of parsing solutions:

- string.split() or string[slice] notations handle simple cases
and are built-in

- regexps handle more complex parsing tasks and are also built in

- pyparsing handles far more complex parsing tasks (nesting, etc)
but isn't built-in


The above dialog tends to appear when the task isn't in the
sweet-spot of regexps. Either it's sufficiently simple that
simple split/slice notation will do, or (at the other end of the
spectrum) the effort to get it working with a regexp is hairy and
convoluted, worthy of a more readable solution implemented with
pyparsing. The problem comes from people thinking that regexps
are the right solution to *every* problem...often demonstrated by
the OP writing "how do I write a regexp to solve this
<non-regexp-optimal> problem" assuming regexps are the right tool
for everything.

There are some problem-classes for which regexps are the *right*
solution, and I don't see as much of your example dialog in those
cases.

-tkc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,702
Latest member
LukasConde

Latest Threads

Top