Smart text parsing

M

Mathias Mamsch

Hi,

I got a text with about 1 million words where I want to count words and put
them sorted to a list
like " list = [(most-common-word,1001),(2nd-word,986), ...] "

I think there are at about 10% (about 100.000) different words in the text.

I am wondering if you can give me something faster than my approach:
My first straightforward approach was:
----
s = "Hello this is my 1 million word text".split()

s2 = s.split()
dict = {}
for i in s2: # the loop needs 10s
if dict.has_key(i):
dict += 1
else:
dict = 1
list = dict.items()
# this is slow:
list.sort(lambda x,y: 2*(x[1] < y[1])-1)
----
That works, but i wonder if there is a faster, more elegant way to do this
....

Thanks for you interest,
Mathias Mamsch
 
J

Josiah Carlson

s = "Hello this is my 1 million word text".split()

s2 = s.split()
dict = {}
for i in s2: # the loop needs 10s
if dict.has_key(i):
dict += 1
else:
dict = 1

list = dict.items()
> # this is slow:
> list.sort(lambda x,y: 2*(x[1] < y[1])-1)


list = zip(dict.values(), dict.keys())
list.sort()

Should be faster due to not using the sort function argument.

- Josiah
 
H

Hans Nowak

Mathias said:
I got a text with about 1 million words where I want to count words and put
them sorted to a list
like " list = [(most-common-word,1001),(2nd-word,986), ...] "

I think there are at about 10% (about 100.000) different words in the text.

I am wondering if you can give me something faster than my approach:
My first straightforward approach was:
----
s = "Hello this is my 1 million word text".split()

s2 = s.split()
dict = {}
for i in s2: # the loop needs 10s
if dict.has_key(i):
dict += 1
else:
dict = 1
list = dict.items()
# this is slow:
list.sort(lambda x,y: 2*(x[1] < y[1])-1)
----


Passing a comparison function to sort slows things down a lot. Try something
like this instead:

parts = "Hello this is my 1 million word text".split()
for part in parts:
if d.has_key(part):
d[part] += 1
else:
d[part] = 1

lst = d.items()
lst = [(t[1], t[0]) for t in lst] # (frequency, string)
lst.sort() # sort as usual
lst.reverse() # reverse, so highest numbers are first

HTH,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,176
Messages
2,570,950
Members
47,503
Latest member
supremedee

Latest Threads

Top