Smart text parsing

Mathias Mamsch · Feb 6, 2004

Hi,

I got a text with about 1 million words where I want to count words and put
them sorted to a list
like " list = [(most-common-word,1001),(2nd-word,986), ...] "

I think there are at about 10% (about 100.000) different words in the text.

I am wondering if you can give me something faster than my approach:
My first straightforward approach was:
----
s = "Hello this is my 1 million word text".split()

s2 = s.split()
dict = {}
for i in s2: # the loop needs 10s
if dict.has_key(i):
dict += 1
else:
dict = 1
list = dict.items()
# this is slow:
list.sort(lambda x,y: 2*(x[1] < y[1])-1)
----
That works, but i wonder if there is a faster, more elegant way to do this
....

Thanks for you interest,
Mathias Mamsch

Josiah Carlson · Feb 6, 2004

s = "Hello this is my 1 million word text".split()

s2 = s.split()
dict = {}
for i in s2: # the loop needs 10s
if dict.has_key(i):
dict += 1
else:
dict = 1

list = dict.items()
> # this is slow:
> list.sort(lambda x,y: 2*(x[1] < y[1])-1)

Click to expand...

list = zip(dict.values(), dict.keys())
list.sort()

Should be faster due to not using the sort function argument.

- Josiah

Hans Nowak · Feb 6, 2004

Mathias said:
I got a text with about 1 million words where I want to count words and put
them sorted to a list
like " list = [(most-common-word,1001),(2nd-word,986), ...] "

I think there are at about 10% (about 100.000) different words in the text.

I am wondering if you can give me something faster than my approach:
My first straightforward approach was:
----
s = "Hello this is my 1 million word text".split()

s2 = s.split()
dict = {}
for i in s2: # the loop needs 10s
if dict.has_key(i):
dict += 1
else:
dict = 1
list = dict.items()
# this is slow:
list.sort(lambda x,y: 2*(x[1] < y[1])-1)
----

Passing a comparison function to sort slows things down a lot. Try something
like this instead:

parts = "Hello this is my 1 million word text".split()
for part in parts:
if d.has_key(part):
d[part] += 1
else:
d[part] = 1

lst = d.items()
lst = [(t[1], t[0]) for t in lst] # (frequency, string)
lst.sort() # sort as usual
lst.reverse() # reverse, so highest numbers are first

HTH,

Genetic algoritm generating the text	0	Aug 18, 2023
Batch modifying text - content and context based	5	Jan 19, 2023
Translater + module + tkinter	1	Feb 16, 2023
Sort and count word pairs in a string	6	Jan 29, 2023
Help for my project in the last minute	0	Apr 23, 2022
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
Seeking help: reading text file with genfromtxt	0	Apr 4, 2012
parsing function parameters	0	Aug 3, 2011

Smart text parsing

Mathias Mamsch

Josiah Carlson

Hans Nowak

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads