Python: Deleting specific words from a file.

P

papu

Hello, I have a data file (un-structed messy file) from which I have
to scrub specific list of words (delete words).

Here is what I am doing but with no result:

infile = "messy_data_file.txt"
outfile = "cleaned_file.txt"

delete_list = ["word_1","word_2"....,"word_n"]
new_file = []
fin=open(infile,"")
fout = open(outfile,"w+")
for line in fin:
for word in delete_list:
line.replace(word, "")
fout.write(line)
fin.close()
fout.close()

I have put the code above in a file. When I execute it, I dont see the
result file. I am not sure what the error is. Please let me know what
I am doing wrong.
 
M

MRAB

Hello, I have a data file (un-structed messy file) from which I have
to scrub specific list of words (delete words).

Here is what I am doing but with no result:

infile = "messy_data_file.txt"
outfile = "cleaned_file.txt"

delete_list = ["word_1","word_2"....,"word_n"]
new_file = []
fin=open(infile,"")
fout = open(outfile,"w+")
for line in fin:
for word in delete_list:
line.replace(word, "")
fout.write(line)
fin.close()
fout.close()

I have put the code above in a file. When I execute it, I dont see the
result file. I am not sure what the error is. Please let me know what
I am doing wrong.

The .replace method _returns_ its result.

Strings are immutable, they can't be changed in-place.
 
J

John Gordon

In said:
Hello, I have a data file (un-structed messy file) from which I have
to scrub specific list of words (delete words).
Here is what I am doing but with no result:
infile = "messy_data_file.txt"
outfile = "cleaned_file.txt"
delete_list = ["word_1","word_2"....,"word_n"]
new_file = []

What does new_file do? I don't see it used anywhere.
fin=open(infile,"")

There should be an "r" inside those quotes. In fact this is an error
and it will stop your program from running.
fout = open(outfile,"w+")

What is the "+" supposed to do?
for line in fin:
for word in delete_list:
line.replace(word, "")

replace() returns the modified string; it does not alter the existing
string.

Do this instead:

line = line.replace(word, "")
fout.write(line)
fin.close()
fout.close()
I have put the code above in a file. When I execute it, I dont see the
result file. I am not sure what the error is. Please let me know what
I am doing wrong.

When you say you don't see the result file, do you mean it doesn't get
created at all?
 
T

Terry Reedy

Hello, I have a data file (un-structed messy file) from which I have
to scrub specific list of words (delete words).

Here is what I am doing but with no result:

infile = "messy_data_file.txt"
outfile = "cleaned_file.txt"

delete_list = ["word_1","word_2"....,"word_n"]
new_file = []
fin=open(infile,"")
fout = open(outfile,"w+")
for line in fin:
for word in delete_list:
line.replace(word, "")
fout.write(line)
fin.close()
fout.close()

If you have very many words (and you will need all possible forms of
each word if you do exact matches), The following (untested and
incomplete) should run faster.

delete_set = {"word_1","word_2"....,"word_n"}
....
for line in fin:
for word in line.split()
if word not in delete_set:
fout.write(word) # also write space and nl.


Depending on what your file is like, you might be better with
re.split('(\W+)', line). An example from the manual:['', '...', 'words', ', ', 'words', '...', '']

so all non-word separator sequences are preserved and written back out
(as they will not match delete set).
 
G

gry

Hello, I have a data file (un-structed messy file) from which I have
to scrub specific list of words (delete words).
Here is what I am doing but with no result:
infile = "messy_data_file.txt"
outfile = "cleaned_file.txt"
delete_list = ["word_1","word_2"....,"word_n"]
new_file = []
fin=open(infile,"")
fout = open(outfile,"w+")
for line in fin:
     for word in delete_list:
         line.replace(word, "")
     fout.write(line)
fin.close()
fout.close()

If you have very many words (and you will need all possible forms of
each word if you do exact matches), The following (untested and
incomplete) should run faster.

delete_set = {"word_1","word_2"....,"word_n"}
...
for line in fin:
     for word in line.split()
         if word not in delete_set:
             fout.write(word) # also write space and nl.

Depending on what your file is like, you might be better with
re.split('(\W+)', line). An example from the manual:
 >>> re.split('(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']

so all non-word separator sequences are preserved and written back out
(as they will not match delete set).

re.sub is handy too:
import re
delete_list=('the','rain','in','spain')
regex = re.compile('\W' + '|'.join(delete_list) + '\W')
infile='messy'
with open(infile, 'r') as f:
for l in f:
print regex.sub('', l)
 
M

MRAB

Hello, I have a data file (un-structed messy file) from which I have
to scrub specific list of words (delete words).
Here is what I am doing but with no result:
infile = "messy_data_file.txt"
outfile = "cleaned_file.txt"
delete_list = ["word_1","word_2"....,"word_n"]
new_file = []
fin=open(infile,"")
fout = open(outfile,"w+")
for line in fin:
for word in delete_list:
line.replace(word, "")
fout.write(line)
fin.close()
fout.close()

If you have very many words (and you will need all possible forms of
each word if you do exact matches), The following (untested and
incomplete) should run faster.

delete_set = {"word_1","word_2"....,"word_n"}
...
for line in fin:
for word in line.split()
if word not in delete_set:
fout.write(word) # also write space and nl.

Depending on what your file is like, you might be better with
re.split('(\W+)', line). An example from the manual:
re.split('(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']

so all non-word separator sequences are preserved and written back out
(as they will not match delete set).

re.sub is handy too:
import re
delete_list=('the','rain','in','spain')
regex = re.compile('\W' + '|'.join(delete_list) + '\W')

You need parentheses around the words (I'm using non-capturing
parentheses):

regex = re.compile(r'\W(?:' + '|'.join(delete_list) + r')\W')

otherwise you'd get: '\Wthe|rain|in|spain\W'.

Even better is the word-boundary, in case there's no previous or next
character:

regex = re.compile(r'\b(?:' + '|'.join(delete_list) + r')\b')

Raw string literals are recommended for regexes.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,188
Members
46,732
Latest member
ArronPalin

Latest Threads

Top