What is wrong in my list comprehension?

H

Hussein B

Hey,
I have a log file that doesn't contain the word "Haskell" at all, I'm
just trying to do a little performance comparison:
++++++++++++++
from datetime import time, timedelta, datetime
start = datetime.now()
print start
lines = [line for line in file('/media/sda4/Servers/Apache/
Tomcat-6.0.14/logs/catalina.out') if line.find('Haskell')]
print 'Number of lines contains "Haskell" = ' + str(len(lines))
end = datetime.now()
print end
++++++++++++++
Well, the script is returning the whole file's lines number !!
What is wrong in my logic?
Thanks.
 
P

Peter Otten

Hussein said:
Hey,
I have a log file that doesn't contain the word "Haskell" at all, I'm
just trying to do a little performance comparison:
++++++++++++++
from datetime import time, timedelta, datetime
start = datetime.now()
print start
lines = [line for line in file('/media/sda4/Servers/Apache/
Tomcat-6.0.14/logs/catalina.out') if line.find('Haskell')]
print 'Number of lines contains "Haskell" = ' + str(len(lines))
end = datetime.now()
print end
++++++++++++++
Well, the script is returning the whole file's lines number !!
What is wrong in my logic?
Thanks.

"""
find(...)
S.find(sub [,start [,end]]) -> int

Return the lowest index in S where substring sub is found,
such that sub is contained within s[start:end]. Optional
arguments start and end are interpreted as in slice notation.

Return -1 on failure.
"""

a.find(b) returns -1 if b is no found. -1 evaluates to True in a boolean
context.

Use

[line for line in open(...) if line.find("Haskell") != -1]

or, better

[line for line in open(...) if "Haskell" in line]

to get the expected result.

Peter
 
C

Chris Rebert

Hey,
I have a log file that doesn't contain the word "Haskell" at all, I'm
just trying to do a little performance comparison:
++++++++++++++
from datetime import time, timedelta, datetime
start = datetime.now()
print start
lines = [line for line in file('/media/sda4/Servers/Apache/
Tomcat-6.0.14/logs/catalina.out') if line.find('Haskell')]
From the help() for str.find:
find(...)
<snip>
Return -1 on failure.

~ $ python
Python 2.6 (r26:66714, Nov 18 2008, 21:48:52)
[GCC 4.0.1 (Apple Inc. build 5484)] on darwin
Type "help", "copyright", "credits" or "license" for more information.True

str.find() returns -1 on failure (i.e. if the substring is not in the
given string).
-1 is considered boolean true by Python.
Therefore, your list comprehension will contain all lines that don't
*start* with "Haskell" rather than all lines that don't *contain*
"Haskell".

Use `if "Haskell" in line` instead of `if line.find("Haskell")`. It's
even easier to read that way.

Cheers,
Chris
 
J

J Kenneth King

Chris Rebert said:
Python 2.6 (r26:66714, Nov 18 2008, 21:48:52)
[GCC 4.0.1 (Apple Inc. build 5484)] on darwin
Type "help", "copyright", "credits" or "license" for more information.True

str.find() returns -1 on failure (i.e. if the substring is not in the
given string).
-1 is considered boolean true by Python.

That's an odd little quirk... never noticed that before.

I just use regular expressions myself.

Wouldn't this be something worth cleaning up? It's a little confusing
for failure to evaluate to boolean true even if the relationship isn't
direct.
 
J

J Kenneth King

Chris Rebert said:
Python 2.6 (r26:66714, Nov 18 2008, 21:48:52)
[GCC 4.0.1 (Apple Inc. build 5484)] on darwin
Type "help", "copyright", "credits" or "license" for more information.True

str.find() returns -1 on failure (i.e. if the substring is not in the
given string).
-1 is considered boolean true by Python.

That's an odd little quirk... never noticed that before.

I just use regular expressions myself.

Wouldn't this be something worth cleaning up? It's a little confusing
for failure to evaluate to boolean true even if the relationship isn't
direct.
 
J

J Kenneth King

Chris Rebert said:
Python 2.6 (r26:66714, Nov 18 2008, 21:48:52)
[GCC 4.0.1 (Apple Inc. build 5484)] on darwin
Type "help", "copyright", "credits" or "license" for more information.True

str.find() returns -1 on failure (i.e. if the substring is not in the
given string).
-1 is considered boolean true by Python.

That's an odd little quirk... never noticed that before.

I just use regular expressions myself.

Wouldn't this be something worth cleaning up? It's a little confusing
for failure to evaluate to boolean true even if the relationship isn't
direct.
 
J

J Kenneth King

Chris Rebert said:
Python 2.6 (r26:66714, Nov 18 2008, 21:48:52)
[GCC 4.0.1 (Apple Inc. build 5484)] on darwin
Type "help", "copyright", "credits" or "license" for more information.True

str.find() returns -1 on failure (i.e. if the substring is not in the
given string).
-1 is considered boolean true by Python.

That's an odd little quirk... never noticed that before.

I just use regular expressions myself.

Wouldn't this be something worth cleaning up? It's a little confusing
for failure to evaluate to boolean true even if the relationship isn't
direct.
 
P

Peter Otten

J said:
Chris Rebert said:
Python 2.6 (r26:66714, Nov 18 2008, 21:48:52)
[GCC 4.0.1 (Apple Inc. build 5484)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
True

str.find() returns -1 on failure (i.e. if the substring is not in the
given string).
-1 is considered boolean true by Python.

That's an odd little quirk... never noticed that before.

I just use regular expressions myself.

Wouldn't this be something worth cleaning up? It's a little confusing
for failure to evaluate to boolean true even if the relationship isn't
direct.

Well, what is your suggested return value when the substring starts at
position 0?
0

By the way, there already is a method with a cleaner (I think) interface:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: substring not found

Peter
 
S

Stephen Hansen

str.find() returns -1 on failure (i.e. if the substring is not in the
That's an odd little quirk... never noticed that before.

I just use regular expressions myself.

Wouldn't this be something worth cleaning up? It's a little confusing
for failure to evaluate to boolean true even if the relationship isn't
direct.

But what would you clean it up to?

str.find can return 0 ... which is a *true* result as that means it
finds what you're looking for at position 0... but which evaluates to
boolean False. The fact that it can also return -1 which is the
*false* result which evaluates to boolean True is just another side of
that coin.

What's the options to clean it up? It can return None when it doesn't
match and you can then test str.find("a") is None... but while that
kinda works it also goes a bit against the point of having boolean
truth/falsehood not representing success/failure of the function. 0
(boolean false) still is a success.

Raising an exception would be a bad idea in many cases, too. You can
use str.index if that's what you want.

So there's not really a great solution to "cleaning it up" . I
remember there was some talk in py-dev of removing str.find entirely
because there was no really c, but I have absolutely no idea if they
ended up doing it or not.

--S
 
M

MRAB

J said:
Chris Rebert said:
Python 2.6 (r26:66714, Nov 18 2008, 21:48:52)
[GCC 4.0.1 (Apple Inc. build 5484)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
True

str.find() returns -1 on failure (i.e. if the substring is not in the
given string).
-1 is considered boolean true by Python.

That's an odd little quirk... never noticed that before.

I just use regular expressions myself.

Wouldn't this be something worth cleaning up? It's a little confusing
for failure to evaluate to boolean true even if the relationship isn't
direct.
str.find() returns the index (position) where the substring was found.
Because string indexes start at 0 the returned value is -1 if it's not
found.

In those languages where string indexes start at 1 the returned value is
0 if not found.
 
J

J Kenneth King

Stephen Hansen said:
But what would you clean it up to?

str.find can return 0 ... which is a *true* result as that means it
finds what you're looking for at position 0... but which evaluates to
boolean False. The fact that it can also return -1 which is the
*false* result which evaluates to boolean True is just another side of
that coin.

What's the options to clean it up? It can return None when it doesn't
match and you can then test str.find("a") is None... but while that
kinda works it also goes a bit against the point of having boolean
truth/falsehood not representing success/failure of the function. 0
(boolean false) still is a success.

Raising an exception would be a bad idea in many cases, too. You can
use str.index if that's what you want.

So there's not really a great solution to "cleaning it up" . I
remember there was some talk in py-dev of removing str.find entirely
because there was no really c, but I have absolutely no idea if they
ended up doing it or not.

--S

(Sorry all for the multiple post... my gnus fudged a bit there)

That's the funny thing about integers having boolean contexts I
guess. Here's a case where 0 actually isn't "False." Any returned value
should be considered "True" and "None" should evaluate to "False." Then
the method can be used in both contexts of logic and procedure.

(I guess that's how I'd solve it, but I can see that implementing it
is highly improbable)

I'm only curious if it's worth cleaning up because the OP's case is one
where there is more than one way to do it.

However, that's not the way the world is and I suppose smarter people
have discussed this before. If there's a link to the discussion, I'd
like to read it. It's pedantic but fascinating no less.
 
R

rdmurray

Quoth Stephen Hansen said:
I just think at this point ".find" is just not the right method to use;
"substring" in "string" is the way to determine what he wants is all.
".find" is useful for when you want the actual position, not when you just
want to determine if there's a match at all. The way I'd clean it is to
remove .find, personally :) I don't remember the outcome of their discussion
on py-dev, and haven't gotten around to loading up Py3 to test it out :)


Python 3.0 (r30:67503, Dec 18 2008, 19:09:30)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.Help on built-in function find:

find(...)
S.find(sub[, start[, end]]) -> int

Return the lowest index in S where substring sub is found,
such that sub is contained within s[start:end]. Optional
arguments start and end are interpreted as in slice notation.

Return -1 on failure.
 
J

Jason Scheirer

Hussein said:
Hey,
I have a log file that doesn't contain the word "Haskell" at all, I'm
just trying to do a little performance comparison:
++++++++++++++
from datetime import time, timedelta, datetime
start = datetime.now()
print start
lines = [line for line in file('/media/sda4/Servers/Apache/
Tomcat-6.0.14/logs/catalina.out') if line.find('Haskell')]
print 'Number of lines contains "Haskell" = ' +  str(len(lines))
end = datetime.now()
print end
++++++++++++++
Well, the script is returning the whole file's lines number !!
What is wrong in my logic?
Thanks.

"""
find(...)
    S.find(sub [,start [,end]]) -> int

    Return the lowest index in S where substring sub is found,
    such that sub is contained within s[start:end].  Optional
    arguments start and end are interpreted as in slice notation.

    Return -1 on failure.
"""

a.find(b) returns -1 if b is no found. -1 evaluates to True in a boolean
context.

Use

[line for line in open(...) if line.find("Haskell") != -1]

or, better

[line for line in open(...) if "Haskell" in line]

to get the expected result.

Peter

Or better, group them together in a generator:

sum(line for line in open(...) if "Haskell" in line)

and avoid allocating a new list with every line that contains Haskell
in it.

http://www.python.org/dev/peps/pep-0289/
 
P

Peter Otten

Jason said:
Hussein said:
Hey,
I have a log file that doesn't contain the word "Haskell" at all, I'm
just trying to do a little performance comparison:
++++++++++++++
from datetime import time, timedelta, datetime
start = datetime.now()
print start
lines = [line for line in file('/media/sda4/Servers/Apache/
Tomcat-6.0.14/logs/catalina.out') if line.find('Haskell')]
print 'Number of lines contains "Haskell" = ' +  str(len(lines))
end = datetime.now()
print end
++++++++++++++
Well, the script is returning the whole file's lines number !!
What is wrong in my logic?
Thanks.

"""
find(...)
S.find(sub [,start [,end]]) -> int

Return the lowest index in S where substring sub is found,
such that sub is contained within s[start:end].  Optional
arguments start and end are interpreted as in slice notation.

Return -1 on failure.
"""

a.find(b) returns -1 if b is no found. -1 evaluates to True in a boolean
context.

Use

[line for line in open(...) if line.find("Haskell") != -1]

or, better

[line for line in open(...) if "Haskell" in line]

to get the expected result.

Peter

Or better, group them together in a generator:

sum(line for line in open(...) if "Haskell" in line)

You probably mean

sum(1 for line in open(...) if "Haskell" in line)

if you want to count the lines containing "Haskell", or

sum(line.count("Haskell") for line in open(...) if "Haskell" in line)

if you want to count the occurences of "Haskell" (where the if clause is
logically superfluous, but may improve performance).
and avoid allocating a new list with every line that contains Haskell
in it.

But note that the OP stated that there were no such lines.

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,297
Messages
2,571,536
Members
48,284
Latest member
alphabetsalphabets

Latest Threads

Top