Quick compare string to list

S

Scooter

I'm reading in a text file, and for each line in the file, I'm looking
for the existence of phrases from a list. The list contains approx.
120 items currently but will most likely grow. This procedure itself
is not the main function of my program and only grew out of the need
to reformat certain phrases I'm finding in a file before re-outputting
it. But as I suspected, this searching of the lists slows the whole
process way way down. Was looking for ideas of a better way to do
this.

I basically have

mylist=[]
....
code that reads in the flat file into string 'flatfileString'
....
for listitem in mylist:
if flatfileString.count(listitem):
...whatever...I found it.
 
T

Terry Reedy

Scooter said:
I'm reading in a text file, and for each line in the file, I'm looking
for the existence of phrases from a list. The list contains approx.
120 items currently but will most likely grow. This procedure itself
is not the main function of my program and only grew out of the need
to reformat certain phrases I'm finding in a file before re-outputting
it. But as I suspected, this searching of the lists slows the whole
process way way down. Was looking for ideas of a better way to do
this.

I basically have

mylist=[]
...
code that reads in the flat file into string 'flatfileString'
...
for listitem in mylist:
if flatfileString.count(listitem):
...whatever...I found it.

I would try:

turn mylist into my_re and compile
for line in file:
while search line for first occurence of any phase returns yes:
process
reduce line to remainder of line after phrase found
# assuming no overlaps

tjr
 
E

Emile van Sebille

On 9/30/2009 11:36 AM Scooter said...
I'm reading in a text file, and for each line in the file, I'm looking
for the existence of phrases from a list. The list contains approx.
120 items currently but will most likely grow. This procedure itself
is not the main function of my program and only grew out of the need
to reformat certain phrases I'm finding in a file before re-outputting
it. But as I suspected, this searching of the lists slows the whole
process way way down. Was looking for ideas of a better way to do
this.

I basically have

mylist=[]
...
code that reads in the flat file into string 'flatfileString'
...
for listitem in mylist:
if flatfileString.count(listitem):
...whatever...I found it.

Whatever you do next to reformat those certain phrases will require a
second scan which doubles the time involved, and as you don't save the
count anyway, if mylist were exchange couplets you could use replace
directly. Something like:

mylist = [('Chevy','Chevrolet'),
('GM','General Motors'),
(... etc... )
]

for wrong,right in mylist:
flatfileString=flatfileString.replace(wrong,right)


Flavor to taste,

Emile
 
B

Bearophile

Scooter:
I'm reading in a text file, and for each line in the file, I'm looking
for the existence of phrases from a list. The list contains approx.
120 items currently but will most likely grow. This procedure itself
is not the main function of my program and only grew out of the need
to reformat certain phrases I'm finding in a file before re-outputting
it. But as I suspected, this searching of the lists slows the whole
process way way down. Was looking for ideas of a better way to do
this.

Know your basic computer science :)
http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

There are probably C implementations that can be used from Python,
like:
http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Bye,
bearophile
 
S

Steven D'Aprano

I'm reading in a text file, and for each line in the file, I'm looking
for the existence of phrases from a list. The list contains approx. 120
items currently but will most likely grow. This procedure itself is not
the main function of my program and only grew out of the need to
reformat certain phrases I'm finding in a file before re-outputting it.
But as I suspected, this searching of the lists slows the whole process
way way down. Was looking for ideas of a better way to do this.

I basically have

mylist=[]
...
code that reads in the flat file into string 'flatfileString' ...
for listitem in mylist:
if flatfileString.count(listitem):
...whatever...I found it.


For starters, why are you bothering to count occurrences of the string if
you only need a There/Not There answer? That's wasteful... it means the
code has to walk the entire length of the flatfileString every single
time. Now, string.count() is likely to be fast because it's written in C,
but it's not instantaneous. Better is:


for listitem in mylist:
if listitem in flatfileString:
process()


That should show a small improvement, but you can probably do better.
Here's two more simple approaches worth trying, all untested:

# Use a regex.
r = re.compile('|'.join(mylist)) # item 0 or item 1 or ...
if r.search(flatfileString):
process()


# Use a loop, re-writing it as a list comprehension for speed.
if any([item in flatfileString for item in mylist]):
process()


# As above, but a generator expression instead.
if any(item in flatfileString for item in mylist):
process()



You will probably find that which approach is faster depends on how many
items are in mylist.

If none of these approaches are fast enough, you may need to look at a
more complicated approach, such as Bearophile's suggestion.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top