How to find all the same words in a text?

J

Johny

I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

Must I use regex?
Thanks for help
L.
 
M

Marco Giusti

I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

ciao
marco

--
reply to `python -c "print '(e-mail address removed)'[::-1]"`

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFzcu6mQRKGuVp5FMRArzTAKCpmT/ykP1K8HQaF30phLeq8zBUzQCfZCEU
6RA4kH2QdMe0wcm97MrUWfM=
=p9iU
-----END PGP SIGNATURE-----
 
J

Johny

Marco,
Thank you for your help.
It works perfectly but I forgot to say that I also need to find the
possition of each word's occurrence.Is it possible that
Thanks.
L
 
M

Marco Giusti

Marco,
Thank you for your help.
It works perfectly but I forgot to say that I also need to find the
possition of each word's occurrence.Is it possible that

play with count and index and take a look at the help of both

ciao
marco

--
reply to `python -c "print '(e-mail address removed)'[::-1]"`

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFzdOomQRKGuVp5FMRAt3/AKCSyzCOdSRijxL0GjK3tspZ/sHaYwCfeDzZ
5pmB1RyUlGjhrnxy1YBFArU=
=r/Hl
-----END PGP SIGNATURE-----
 
T

Thorsten Kampe

* Johny (10 Feb 2007 05:29:23 -0800)
I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

Must I use regex?

There are two approaches: one is the "solve once and forget" approach
where you code around this particular problem. Mario showed you one
solution for this.

The other approach would be to realise that your problem is a specific
case of two general problems: partitioning a sequence by a separator
and partioning a sequence into equivalence classes. The bonus for this
approach is that you will have a /lot/ of problems that can be solved
with either one of these utils or a combination of them.

1>>> a = '45 324 45324'
2>>> quotient_set(part(a, [' ', ' '], 'sep'), ident)
2: {'324': ['324'], '45': ['45'], '45324': ['45324']}

The latter approach is much more flexible. Just imagine your problem
changes to a string that's separated by newlines (instead of spaces)
and you want to find words that start with the same character (instead
of being the same as criterion).


Thorsten
 
S

Samuel Karl Peterson

I need to find all the same words in a text .
What would be the best idea to do that?

I make no claims of this being the best approach:

====================
def findOccurances(a_string, word):
"""
Given a string and a word, returns a double:
[0] = count [1] = list of indexes where word occurs
"""
import re
count = 0
indexes = []
start = 0 # offset for successive passes
pattern = re.compile(r'\b%s\b' % word, re.I)

while True:
match = pattern.search(a_string)
if not match: break
count += 1;
indexes.append(match.start() + start)
start += match.end()
a_string = a_string[match.end():]

return (count, indexes)
====================

Seems to work for me. No guarantees.
 
N

Neil Cerutti

I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

Must I use regex?
Thanks for help

The first thing to do is to answer the question: What is a word?

The second thing to do is to design some code that can find
words in strings.

The last thing to do is to search those actual words for the word
you're looking for.
 
?

=?ISO-8859-1?Q?Ma=EBl_Benjamin_Mettler?=

In order to find all the words in a text, you need to tokenize it first.
The rest is a matter of calling the count method on the list of
tokenized words. For tokenization look here:
http://nltk.sourceforge.net/lite/doc/en/words.html
A little bit of warning: depending on what exactly you need to do, the
seemingly trivial taks of tokenizing a text can become quite complex.

Enjoy,

Maël
 
A

attn.steven.kuo

I need to find all the same words in a text .
What would be the best idea to do that?

I make no claims of this being the best approach:

====================
def findOccurances(a_string, word):
"""
Given a string and a word, returns a double:
[0] = count [1] = list of indexes where word occurs
"""
import re
count = 0
indexes = []
start = 0 # offset for successive passes
pattern = re.compile(r'\b%s\b' % word, re.I)

while True:
match = pattern.search(a_string)
if not match: break
count += 1;
indexes.append(match.start() + start)
start += match.end()
a_string = a_string[match.end():]

return (count, indexes)
====================

Seems to work for me. No guarantees.



More concisely:

import re

pattern = re.compile(r'\b324\b')
indices = [ match.start() for match in
pattern.finditer(target_string) ]
print "Indices", indices
print "Count: ", len(indices)
 
S

Samuel Karl Peterson

(e-mail address removed) on 11 Feb 2007 08:16:11 -0800 didst step
forth and proclaim thus:
More concisely:

import re

pattern = re.compile(r'\b324\b')
indices = [ match.start() for match in
pattern.finditer(target_string) ]
print "Indices", indices
print "Count: ", len(indices)

Thank you, this is educational. I didn't realize that finditer
returned match objects instead of tuples.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,990
Messages
2,570,211
Members
46,796
Latest member
SteveBreed

Latest Threads

Top