F
Fred Mangusta
Hi,
I'm relatively new to programming in general, and totally new to python,
and I've been told that this language is particularly good for what I
need to do. Let me explain.
I have a large corpus of English text, in the form of several files.
First of all I would like to scan each file. Then, for each word I find,
I'd like to examine its case status, and write the (lower case) word back
to another text file - with, appended, a tag stating the case it had in
the original file.
An example. Suppose we have three possible "case conditions"
-all lowercase
-all uppercase
-initial uppercase only
Three corresponding tags for each of these might be, respectively:
-nocap
-allcaps
-cap
Therefore, given the string
"The Chairman of BP was asleep"
I would like to produce
"the/cap chairman/cap of/nocap /bp/allcaps was/nocap /asleep/nocap"
and writing this into a file.
I have the following algorithm in mind:
-open input file
-open output file
-get line of text
-split line into words
-for each word
-tag = checkCase(word)
-newword = lowercase(word) + append(tag)
rejoin words into line
write line into output file
Now, I managed to write the following initial code
for s in file:
lines += 1
if lines % 1000 == 0:
print '%d lines' % We print the total lines
sent = s.split() #split string by spaces
#...
But then I don't quite know what would be the fastest/best way to do
this. Could I use the join function to reform the string? And, regarding
the casetest() function, what do you suggest to do? Should I test each
character of each word or there are faster methods?
Thanks very much,
F.
I'm relatively new to programming in general, and totally new to python,
and I've been told that this language is particularly good for what I
need to do. Let me explain.
I have a large corpus of English text, in the form of several files.
First of all I would like to scan each file. Then, for each word I find,
I'd like to examine its case status, and write the (lower case) word back
to another text file - with, appended, a tag stating the case it had in
the original file.
An example. Suppose we have three possible "case conditions"
-all lowercase
-all uppercase
-initial uppercase only
Three corresponding tags for each of these might be, respectively:
-nocap
-allcaps
-cap
Therefore, given the string
"The Chairman of BP was asleep"
I would like to produce
"the/cap chairman/cap of/nocap /bp/allcaps was/nocap /asleep/nocap"
and writing this into a file.
I have the following algorithm in mind:
-open input file
-open output file
-get line of text
-split line into words
-for each word
-tag = checkCase(word)
-newword = lowercase(word) + append(tag)
rejoin words into line
write line into output file
Now, I managed to write the following initial code
for s in file:
lines += 1
if lines % 1000 == 0:
print '%d lines' % We print the total lines
sent = s.split() #split string by spaces
#...
But then I don't quite know what would be the fastest/best way to do
this. Could I use the join function to reform the string? And, regarding
the casetest() function, what do you suggest to do? Should I test each
character of each word or there are faster methods?
Thanks very much,
F.