How do I skip over multiple words in a file?

C

chad

Let's say that I have an article. What I want to do is read in this
file and have the program skip over ever instance of the words "the",
"and", "or", and "but". What would be the general strategy for
attacking a problem like this?
 
T

Tim Chase

Let's say that I have an article. What I want to do is read in
this file and have the program skip over ever instance of the
words "the", "and", "or", and "but". What would be the
general strategy for attacking a problem like this?

I'd keep a file of "stop words", read them into a set
(normalizing case in the process). Then, as I skim over each
word in my target file, check if the case-normalized version of
the word is in your stop-words and skipping if it is. It might
look something like this:

def normalize_word(s):
return s.strip().upper()

stop_words = set(
normalize_word(word)
for word in file('stop_words.txt')
)
for line in file('data.txt'):
for word in line.split():
if normalize_word(word) in stop_words: continue
process(word)

-tkc
 
R

r0g

Let's say that I have an article. What I want to do is read in this
file and have the program skip over ever instance of the words "the",
"and", "or", and "but". What would be the general strategy for
attacking a problem like this?


If your files are not too big I'd simply read them into a string and do
a string replace for each word you want to skip. If you want case
insensitivity use re.replace() instead of the default string.replace()
method. Neither are elegant or all that efficient but both are very
easy. If your use case requires something high performance then best
keep looking :)

Roger.
 
P

Paul Watson

Let's say that I have an article. What I want to do is read in this
file and have the program skip over ever instance of the words "the",
"and", "or", and "but". What would be the general strategy for
attacking a problem like this?

I realize that you may need or want to do this in Python. This would be
trivial in an awk script.
 
P

Paul Rubin

chad said:
Let's say that I have an article. What I want to do is read in this
file and have the program skip over ever instance of the words "the",
"and", "or", and "but". What would be the general strategy for
attacking a problem like this?

Something like (untested):

stopwords = set (('and', 'or', 'but'))

def goodwords():
for line in file:
for w in line.split():
if w.lower() not in stopwords:
yield w

Removing punctuation is left as an exercise.
 
S

Stefan Sonnenberg-Carstens

Am 11.11.2010 21:33, schrieb Paul Watson:
I realize that you may need or want to do this in Python. This would
be trivial in an awk script.
There are several ways to do this.

skip = ('and','or','but')
all=[]
[[all.append(w) for w in l.split() if w not in skip] for l in
open('some.txt').readlines()]
print all

If some.txt contains your original question, it returns this:
["Let's", 'say', 'that', 'I', 'have', 'an', 'article.', 'What', 'I',
'want', 'to
', 'do', 'is', 'read', 'in', 'this', 'file', 'have', 'the', 'program',
'skip', '
over', 'ever', 'instance', 'of', 'the', 'words', '"the",', '"and",',
'"or",', '"
but".', 'What', 'would', 'be', 'the', 'general', 'strategy', 'for',
'attacking',
'a', 'problem', 'like', 'this?']

But this _one_ way to get there.
Faster solutions could be based on a regex:
import re
skip = ('and','or','but')
all = re.compile('(\w+)')
print [w for w in all.findall(open('some.txt').read()) if w not in skip]

this gives this result (you loose some punctuation etc):
['Let', 's', 'say', 'that', 'I', 'have', 'an', 'article', 'What', 'I',
'want', '
to', 'do', 'is', 'read', 'in', 'this', 'file', 'have', 'the', 'program',
'skip',
'over', 'ever', 'instance', 'of', 'the', 'words', 'the', 'What',
'would', 'be',
'the', 'general', 'strategy', 'for', 'attacking', 'a', 'problem',
'like', 'this
']

But there are some many ways to do it ...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,817
Latest member
AdalbertoT

Latest Threads

Top