Faster 'if char in string' test

K

Kamilche

I was looking for a way to speed up detecting invalid characters in my
TCP string, and thought of yet another use for the translate function!
If you were to 'translate out' the bad characters, and compare string
lengths afterwards, you would know whether or not the line contained
invalid characters. The new method is more than 10x faster than the
standard 'if char in string' test! So - here's the code plus sample
timings:

'''
Translate Speed Test

This code looks for invalid characters in a string,
and raises an exception when it finds one.
I'm testing 2 methods: the 'if char in string' method,
and one based on using the 'translate' function and
comparing string lengths afterwards.
Wow, what a difference! Translate is over 10x faster!

Function Loops Seconds Loops/sec
***********************************************
In 10000 0.171 58479
Translate 10000 0.016 624998

'''

import mytime
import string


_allchars = None
_deletechars = None
_validchars = string.ascii_letters + string.digits + \
"!@#$%^&*()`~-_=+[{]}\\|;:\'\",<.>/?\t "


def init():
global _allchars, _deletechars
l = []
a = []
for i in range(256):
a.append(chr(i))
if not chr(i) in _validchars:
l.append(chr(i))
_deletechars = ''.join(l)
_allchars = ''.join(a)


def test():
max = 10000
tmr = mytime.Timer()
r = range(max)
s = "This is a string to test for invalid characters."
print tmr.heading

tmr.startit()
for i in r:
for c in s:
if c in _deletechars:
raise Exception("Invalid character found!")
tmr.stopit(max)
print tmr.results('In')

tmr.startit()
for i in r:
s2 = s.translate(_allchars, _deletechars)
if len(s2) != len(s):
raise Exception("Invalid character found!")
tmr.stopit(max)
print tmr.results('Translate')


init()

if __name__ == "__main__":
test()
 
P

Peter Otten

Kamilche said:
was looking for a way to speed up detecting invalid characters in my
TCP string, and thought of yet another use for the translate function!
If you were to 'translate out' the bad characters, and compare string
lengths afterwards, you would know whether or not the line contained
invalid characters. The new method is more than 10x faster than the
standard 'if char in string' test! So - here's the code plus sample
timings:

For a fair comparison you should use a dictionary instead of a string for
the containment test.

Peter
 
R

Reinhold Birkenfeld

Kamilche said:
I was looking for a way to speed up detecting invalid characters in my
TCP string, and thought of yet another use for the translate function!
If you were to 'translate out' the bad characters, and compare string
lengths afterwards, you would know whether or not the line contained
invalid characters. The new method is more than 10x faster than the
standard 'if char in string' test! So - here's the code plus sample
timings:

Perhaps it would be a good idea to post the snippets at the Python cookbook.

Reinhold
 
H

Harald Massa

Kamilche,

have you tried module sets?

tmp_validchars = string.ascii_letters + string.digits + \
"!@#$%^&*()`~-_=+[{]}\\|;:\'\",<.>/?\t "
_validchars=sets.Set(tmp_validchars)

....

the test becomes:

testset=sets.Set(s)
if testset.difference(_validchars):
raise Exception("Invalid character found!")


it may be even faster...


best wishes

Harald
 
?

=?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?=

why don't you use translate to delete all the valid characters and test
if the resulting string is not not empty ? this will save you two calls to
len()

hehe...
 
E

Elbert Lev

Very good!

One remark:

translate method is much slower then "c in string" method in the case
when the test string is very long (say 5mb) and the invalid character
is located close to the beginning of the string. (10000 times slower!)
 
K

Kamilche

Pierre-Frédéric Caillaud said:
why don't you use translate to delete all the valid characters and test
if the resulting string is not not empty ? this will save you two calls to
len()... heh heh

You're right, Pierre, that just might be more efficient! But I doubt
those two 'len' calls would show up on a profiler, somehow. ;-) But
you have to time it again, because the question becomes 'how efficient
is the translate function at removing invalid characters from a
string?' It might be much more efficient at removing one char, than at
removing a thousand. So the timings would still be necessary, to make
sure you're not optimizing the wrong part.

The 'dict' solution proposed wouldn't work because the data I'm
testing is in string form, and the overhead of translating the string
to a dict before testing would swamp the results. So would using a
set, because timings show a set is 4x slower than a dict.

Unless I'm misunderstanding Peter's suggestion. Did you mean
translating the string into a dict, then using a 'if boguschar in
dict' test?
 
K

Kamilche

Pierre-Frédéric Caillaud said:
why don't you use translate to delete all the valid characters and test
if the resulting string is not not empty ? this will save you two calls to
len()... heh heh

You're right, Pierre, that just might be more efficient! But I doubt
those two 'len' calls would show up on a profiler, somehow. ;-) But
you have to time it again, because the question becomes 'how efficient
is the translate function at removing invalid characters from a
string?' It might be much more efficient at removing one char, than at
removing a thousand. So the timings would still be necessary, to make
sure you're not optimizing the wrong part.

The 'dict' solution proposed wouldn't work because the data I'm
testing is in string form, and the overhead of translating the string
to a dict before testing would swamp the results. So would using a
set, because timings show a set is 4x slower than a dict.

Unless I'm misunderstanding Peter's suggestion. Did you mean
translating the string into a dict, then using a 'if boguschar in
dict' test?
 
P

Peter Otten

Kamilche said:
he 'dict' solution proposed wouldn't work because the data I'm
testing is in string form, and the overhead of translating the string
to a dict before testing would swamp the results. So would using a
set, because timings show a set is 4x slower than a dict.

Unless I'm misunderstanding Peter's suggestion. Did you mean
translating the string into a dict, then using a 'if boguschar in
dict' test?

Look for validate_dict() to see what I meant. I've since found a faster
alternative using regular expressions.

<code>
import itertools, re, string

_validchars = string.ascii_letters + string.digits + \
"!@#$%^&*()`~-_=+[{]}\\|;:\'\",<.>/?\t "
_allchars = string.maketrans("", "")
_invalidchars = _allchars.translate(_allchars, _validchars)

_invaliddict = dict(itertools.izip(_invalidchars, itertools.repeat(None)))

valid = "This is a string to test for invalid characters."
invalid = valid + _invalidchars + valid
massaged = (_invalidchars + valid) * 100

rInvalid = re.compile("[%s]" % re.escape(_invalidchars))

def validate_list(s, invalid=_invalidchars):
for c in s:
if c in invalid:
return False
return True

def validate_dict(s, invalid=_invaliddict):
for c in s:
if c in invalid:
return False
return True

def validate_translate(s):
return len(s.translate(_allchars, _invalidchars)) == len(s)

def validate_regex(s, rInvalid=rInvalid):
return not rInvalid.search(s)

def validate_noop(s, valid=valid):
return s is valid


if __name__ == "__main__":
import timeit
tests = [f for k, f in globals().items() if k.startswith("validate_")]
for f in tests:
assert f(valid)
assert not f(invalid)

for f in tests:
name = f.__name__
print name
for data in ["valid", "invalid", "massaged"]:
print " ", data,
timeit.main([
"-sfrom findcharspeed import %s as f, %s as d" % (name, data),
"f(d)"])


</code>

Here are my timings:

validate_regex
valid 100000 loops, best of 3: 2.24 usec per loop
invalid 100000 loops, best of 3: 2.37 usec per loop
massaged 1000000 loops, best of 3: 1.61 usec per loop
validate_dict
valid 100000 loops, best of 3: 10.8 usec per loop
invalid 100000 loops, best of 3: 10.7 usec per loop
massaged 1000000 loops, best of 3: 0.99 usec per loop
validate_translate
valid 100000 loops, best of 3: 2.62 usec per loop
invalid 100000 loops, best of 3: 3.68 usec per loop
massaged 10000 loops, best of 3: 57.6 usec per loop
validate_list
valid 100000 loops, best of 3: 14.2 usec per loop
invalid 100000 loops, best of 3: 14 usec per loop
massaged 1000000 loops, best of 3: 1.02 usec per loop
validate_noop
valid 1000000 loops, best of 3: 0.582 usec per loop
invalid 1000000 loops, best of 3: 0.622 usec per loop
massaged 1000000 loops, best of 3: 0.634 usec per loop

Note how badly your solution using str.translate() may perform for certain
data due to its failure to shortcut.

Peter
 
K

Kamilche

translate method is much slower then "c in string" method in the case
when the test string is very long (say 5mb) and the invalid character
is located close to the beginning of the string. (10000 times slower!)

Good point! I was surprised that allocating a fresh string and
comparing lengths was speedier, but upon reflection, I could see how
that could be so. And I also see how if it was a HUMONGOUS string, a
book more like, this would no longer be the case.

Thanks for pointing that out!
 
K

Kamilche

Peter Otten said:
Note how badly your solution using str.translate() may perform for certain
data due to its failure to shortcut.

Peter

Heheh, I'm a Python newbie. I can't even understand the code you have
written, but I'll try.

You know, as I was writing stuff like fn(**args.__dict__) yesterday
(which for me is pretty advanced!), it occurred to me that Python is
not 'simpler than C', like I had originally thought. :-D It's better
in many ways, but it has its dark alleyways where a novice can get
stabbed in the back, too.

{looks suspiciously down the dark alleyway where Peter's code is
standing, and tries to glide by undetected.}

--Kamilche
 
?

=?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?=

You know, as I was writing stuff like fn(**args.__dict__) yesterday
(which for me is pretty advanced!), it occurred to me that Python is
not 'simpler than C'

In the case of variable number of args, C gives you neither the number of
arguments, not their types, nor their names, so you have to send the
number as a parameter or append a null at the end which... sucks !
 
P

Peter Otten

Kamilche said:
Heheh, I'm a Python newbie. I can't even understand the code you have
written, but I'll try.

If I don't understand Python code, I tend to blame the code. I'll try to
either explain or rewrite the "dark" pieces below.
You know, as I was writing stuff like fn(**args.__dict__) yesterday
(which for me is pretty advanced!), it occurred to me that Python is
not 'simpler than C', like I had originally thought. :-D It's better
in many ways, but it has its dark alleyways where a novice can get
stabbed in the back, too.

Python and C are both simple. Python has friendlier error messages and less
bloat, i. e. manual garbage collection in programs that actually work.
Recent additions to Python and the newstyle/classic class schism, while all
reasonable in themselves, make it harder to "know it all". If decorators
take off, this will become even worse.
Enough pessimism - Python is still much more compact than C++, and if anyone
has the power to say no, then it's Guido van Rossum. If a way to integrate
type inferencing smoothly is actually found, why should anyone want to
program in C at all?

Peter

Now the revised code:

import itertools, re, string

_validchars = string.ascii_letters + string.digits + \
"!@#$%^&*()`~-_=+[{]}\\|;:\'\",<.>/?\t "
_allchars = string.maketrans("", "")
_invalidchars = _allchars.translate(_allchars, _validchars)

# repeat(v) gives an "endless" sequence of v, so
# zip("abc", repeat(None)) == [("a", None), ("b", None), ("c", None)]
# In Python 2.4 I would use a set but in 2.3.x dicts are faster.
# _invalidset = set(_invalidcars)
_invaliddict = dict(zip(_invalidchars, itertools.repeat(None)))

# contains only valid characters
valid = "This is a string to test for invalid characters."

# contains both valid and invalid characters
invalid = valid + _invalidchars + valid

# A LONG string containing MANY invalid characters to make
# the translate() solution look bad
massaged = (_invalidchars + valid) * 100

# supposing _invalidchars == "abc" we get the regex re.compile("[abc]").
# rInvalid.search(s) would then return a non-None result whenever s
# contains an a, b, or c. The nice thing for our purpose is that it stops
# at the first match, i. e. if the first char is invalid, we don't care
# about the rest
rInvalid = re.compile("[%s]" % re.escape(_invalidchars))

def validate_list(s, invalid=_invalidchars):
# since the set of invalid characters is used many times,
# it is faster to have it as a local variable.
# I don't know why, but achieving this via a default parameter
# is even faster than a
# invalid = _invalidchars
# assignment at the beginning of the function
for c in s:
if c in invalid:
return False
return True

def validate_dict(s, invalid=_invaliddict):
# exactly the same as validate_list, but the
# c in invalid
# is faster because invalid is a dictionary
for c in s:
if c in invalid:
return False
return True

def validate_translate(s):
return len(s.translate(_allchars, _invalidchars)) == len(s)

def validate_regex(s):
return not rInvalid.search(s)

def validate_noop(s, valid=valid):
""" Not a valid implementation. Only to estimate the function
call overhead.
"""
return s is valid


def measure(func, data):
""" Run the speed test for a func/data pair and print the result.

func: name of the function to be tested
data: name of the variable containing the sample data
"""
print " ", data,
setup = "from findcharspeed import %s, %s" % (func, data)
run = "%s(%s)" % (func, data)
# example of the above:
# from findcharspeed import validate_list, valid
# validate_list(valid) # this statement will be timed

# timeit.main() expects parameters from the commandline, so we have
# to add the appropriate switches when invoking it programmatically.
timeit.main(["-s" + setup, run])

if __name__ == "__main__":
import timeit

# functions I want to time
tests = [validate_list, validate_dict, validate_translate,
validate_regex, validate_noop]

# make sure the functions return the correct result
for f in tests:
assert f(valid)
assert not f(invalid)

# calculate the timings
for f in tests:
print f.__name__
# test each candidate function with 3 different sample strings
for data in ["valid", "invalid", "massaged"]:
measure(f.__name__, data)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,999
Messages
2,570,243
Members
46,836
Latest member
login dogas

Latest Threads

Top