Simple Text Processing Help

P

patrick.waldo

Hi all,

I started Python just a little while ago and I am stuck on something
that is really simple, but I just can't figure out.

Essentially I need to take a text document with some chemical
information in Czech and organize it into another text file. The
information is always EINECS number, CAS, chemical name, and formula
in tables. I need to organize them into lines with | in between. So
it goes from:

200-763-1 71-73-8
nátrium-tiopentál C11H18N2O2S.Na to:

200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

but if I have a chemical like: kyselina moÄová

I get:
200-720-7|69-93-2|kyselina|moÄová
|C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

and then it is all off.

How can I get Python to realize that a chemical name may have a space
in it?

Thank you,
Patrick

So far I have:

#take tables in one text file and organize them into lines in another

import codecs

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

#read and enter into a list
chem_file = []
chem_file.append(input.read())

#split words and store them in a list
for word in chem_file:
words = word.split()

#starting values in list
e=0 #EINECS
c=1 #CAS
ch=2 #chemical name
f=3 #formula

n=0
loop=1
x=len(words) #counts how many words there are in the file

print '-'*100
while loop==1:
if n<x and f<=x:
print words[e], '|', words[c], '|', words[ch], '|', words[f],
'\n'
output.write(words[e])
output.write('|')
output.write(words[c])
output.write('|')
output.write(words[ch])
output.write('|')
output.write(words[f])
output.write('\r\n')
#increase variables by 4 to get next set
e = e + 4
c = c + 4
ch = ch + 4
f = f + 4
# increase by 1 to repeat
n=n+1
else:
loop=0

input.close()
output.close()
 
M

Marc 'BlackJack' Rintsch

Essentially I need to take a text document with some chemical
information in Czech and organize it into another text file. The
information is always EINECS number, CAS, chemical name, and formula
in tables. I need to organize them into lines with | in between. So
it goes from:

200-763-1 71-73-8
nátrium-tiopentál C11H18N2O2S.Na to:

Is that in *one* line in the input file or two lines like shown here?
200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

but if I have a chemical like: kyselina moÄová

I get:
200-720-7|69-93-2|kyselina|moÄová
|C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

and then it is all off.

How can I get Python to realize that a chemical name may have a space
in it?

If the two elements before and the one element after the name can't
contain spaces it is easy: take the first two and the last as it is and
for the name take from the third to the next to last element = the name
and join them with a space.

In [202]: parts = '123 456 a name with spaces 789'.split()

In [203]: parts[0]
Out[203]: '123'

In [204]: parts[1]
Out[204]: '456'

In [205]: ' '.join(parts[2:-1])
Out[205]: 'a name with spaces'

In [206]: parts[-1]
Out[206]: '789'

This works too if the name doesn't have a space in it:

In [207]: parts = '123 456 name 789'.split()

In [208]: parts[0]
Out[208]: '123'

In [209]: parts[1]
Out[209]: '456'

In [210]: ' '.join(parts[2:-1])
Out[210]: 'name'

In [211]: parts[-1]
Out[211]: '789'
#read and enter into a list
chem_file = []
chem_file.append(input.read())

This reads the whole file and puts it into a list. This list will
*always* just contain *one* element. So why a list at all!?
#split words and store them in a list
for word in chem_file:
words = word.split()

*If* the list would contain more than one element all would be processed
but only the last is bound to `words`. You could leave out `chem_file` and
the loop and simply do:

words = input.read().split()

Same effect but less chatty. ;-)

The rest of the source seems to indicate that you don't really want to read
in the whole input file at once but process it line by line, i.e. chemical
element by chemical element.

Ciao,
Marc 'BlackJack' Rintsch
 
P

Paul Hankin

Hi all,

I started Python just a little while ago and I am stuck on something
that is really simple, but I just can't figure out.

Essentially I need to take a text document with some chemical
information in Czech and organize it into another text file. The
information is always EINECS number, CAS, chemical name, and formula
in tables. I need to organize them into lines with | in between. So
it goes from:

200-763-1 71-73-8
nátrium-tiopentál C11H18N2O2S.Na to:

200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

but if I have a chemical like: kyselina moÄová

I get:
200-720-7|69-93-2|kyselina|moÄová
|C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

and then it is all off.

How can I get Python to realize that a chemical name may have a space
in it?

In the original file, is every chemical on a line of its own? I assume
it is here.

You might use a regexp (look at the re module), or I think here you
can use the fact that only chemicals have spaces in them. Then, you
can split each line on whitespace (like you're doing), and join back
together all the words between the 3rd (ie index 2) and the last (ie
index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses
the somewhat unusual python syntax for replacing a section of a list
with another list.

The approach you took involves reading the whole file, and building a
list of all the chemicals which you don't seem to use: I've changed it
to a per-line version and removed the big lists.

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])]
chemical = u'|'.join(tokens)
print chemical + u'\n'
output.write(chemical + u'\r\n')

input.close()
output.close()

Obviously, this isn't tested because I don't have your chem_1_utf8.txt
file.
 
P

patrick.waldo

Thank you both for helping me out. I am still rather new to Python
and so I'm probably trying to reinvent the wheel here.

When I try to do Paul's response, I get[]

So I am not quite sure how to read line by line.

tokens = input.read().split() gets me all the information from the
file. tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like
in the example; however, how can I loop this for the entire document?
Also, when I try output.write(tokens), I get "TypeError: coercing to
Unicode: need string or buffer, list found".

Any ideas?

















I started Python just a little while ago and I am stuck on something
that is really simple, but I just can't figure out.
Essentially I need to take a text document with some chemical
information in Czech and organize it into another text file. The
information is always EINECS number, CAS, chemical name, and formula
in tables. I need to organize them into lines with | in between. So
it goes from:
200-763-1 71-73-8
nátrium-tiopentál C11H18N2O2S.Na to:

but if I have a chemical like: kyselina moÄová
I get:
200-720-7|69-93-2|kyselina|moÄová
|C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
and then it is all off.
How can I get Python to realize that a chemical name may have a space
in it?

In the original file, is every chemical on a line of its own? I assume
it is here.

You might use a regexp (look at the re module), or I think here you
can use the fact that only chemicals have spaces in them. Then, you
can split each line on whitespace (like you're doing), and join back
together all the words between the 3rd (ie index 2) and the last (ie
index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses
the somewhat unusual python syntax for replacing a section of a list
with another list.

The approach you took involves reading the whole file, and building a
list of all the chemicals which you don't seem to use: I've changed it
to a per-line version and removed the big lists.

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])]
chemical = u'|'.join(tokens)
print chemical + u'\n'
output.write(chemical + u'\r\n')

input.close()
output.close()

Obviously, this isn't tested because I don't have your chem_1_utf8.txt
file.
 
M

Marc 'BlackJack' Rintsch

Thank you both for helping me out. I am still rather new to Python
and so I'm probably trying to reinvent the wheel here.

When I try to do Paul's response, I get[]

What is in `line`? Paul wrote this in the body of the ``for`` loop over
all the lines in the file.
So I am not quite sure how to read line by line.

That's what the ``for`` loop over a file or file-like object is doing.
Maybe you should develop your script in smaller steps and do some printing
to see what you get at each step. For example after opening the input
file:

for line in input:
print line # prints the whole line.
tokens = line.split()
print tokens # prints a list with the split line.
tokens = input.read().split() gets me all the information from the
file.

Right it reads *all* of the file, not just one line.
tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like
in the example; however, how can I loop this for the entire document?

Don't read the whole file but line by line, just like Paul showed you.
Also, when I try output.write(tokens), I get "TypeError: coercing to
Unicode: need string or buffer, list found".

`tokens` is a list but you need to write a unicode string. So you have to
reassemble the parts with '|' characters in between. Also shown by Paul.

Ciao,
Marc 'BlackJack' Rintsch
 
J

John Machin

Hi all,

I started Python just a little while ago and I am stuck on something
that is really simple, but I just can't figure out.

Essentially I need to take a text document with some chemical
information in Czech and organize it into another text file. The
information is always EINECS number, CAS, chemical name, and formula
in tables. I need to organize them into lines with | in between. So
it goes from:

200-763-1 71-73-8
nátrium-tiopentál C11H18N2O2S.Na to:

200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

but if I have a chemical like: kyselina moÄová

I get:
200-720-7|69-93-2|kyselina|moÄová
|C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

and then it is all off.

How can I get Python to realize that a chemical name may have a space
in it?

Your input file could be in one of THREE formats:
(1) fields are separated by TAB characters (represented in Python by
the escape sequence '\t', and equivalent to '\x09')
(2) fields are fixed width and padded with spaces
(3) fields are separated by a random number of whitespace characters
(and can contain spaces).

What makes you sure that you have format 3? You might like to try
something like
lines = open('your_file.txt').readlines()[:4]
print lines
print map(len, lines)
This will print a *precise* representation of what is in the first
four lines, plus their lengths. Please show us the output.
 
P

patrick.waldo

lines = open('your_file.txt').readlines()[:4]
print lines
print map(len, lines)

gave me:
['\xef\xbb\xbf200-720-7 69-93-2\n', 'kyselina mo\xc4\x8dov
\xc3\xa1 C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n']
[28, 32, 1, 18]

I think it means that I'm still at option 3. I got the line by line
part. My code is a lot cleaner now:

import codecs

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])] #this doesn't seem to
combine the files correctly
file = u'|'.join(tokens) #this does put '|' in
between
print file + u'\n'
output.write(file + u'\r\n')

input.close()
output.close()

my sample input file looks like this( not organized,as you see it):
200-720-7 69-93-2
kyselina mocová C5H4N4O3

200-001-8 50-00-0
formaldehyd CH2O

200-002-3
50-01-1
guanidínium-chlorid CH5N3.ClH

etc...

and after the program I get:

200-720-7|69-93-2|
kyselina|mocová||C5H4N4O3

200-001-8|50-00-0|
formaldehyd|CH2O|

200-002-3|
50-01-1|
guanidínium-chlorid|CH5N3.ClH|

etc...
So, I am sort of back at the start again.

If I add:

tokens = line.strip().split()
for token in tokens:
print token

I get all the single tokens, which I thought I could then put
together, except when I did:

for token in tokens:
s = u'|'.join(token)
print s

I got ?|2|0|0|-|7|2|0|-|7, etc...

How can I join these together into nice neat little lines? When I try
to store the tokens in a list, the tokens double and I don't know
why. I can work on getting the chemical names together after...baby
steps, or maybe I am just missing something obvious. The first two
numbers will always be the same three digits-three digits-one digit
and then two digits-two digits-one digit.

My intuition tells me that I need to add an if statement that says, if
the first two numbers follow the pattern, then continue, if they don't
(ie a chemical name was accidently split apart) then the third entry
needs to be put together. Something like
if tokens.startswith('pattern') == true


Again, thanks so much. I've gone to http://gnosis.cx/TPiP/ and I have
a couple O'Reilly books, but they don't seem to have a straightforward
example for this kind of text manipulation.

Patrick


I started Python just a little while ago and I am stuck on something
that is really simple, but I just can't figure out.
Essentially I need to take a text document with some chemical
information in Czech and organize it into another text file. The
information is always EINECS number, CAS, chemical name, and formula
in tables. I need to organize them into lines with | in between. So
it goes from:
200-763-1 71-73-8
nátrium-tiopentál C11H18N2O2S.Na to:

but if I have a chemical like: kyselina moÄová
I get:
200-720-7|69-93-2|kyselina|moÄová
|C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
and then it is all off.
How can I get Python to realize that a chemical name may have a space
in it?

Your input file could be in one of THREE formats:
(1) fields are separated by TAB characters (represented in Python by
the escape sequence '\t', and equivalent to '\x09')
(2) fields are fixed width and padded with spaces
(3) fields are separated by a random number of whitespace characters
(and can contain spaces).

What makes you sure that you have format 3? You might like to try
something like
lines = open('your_file.txt').readlines()[:4]
print lines
print map(len, lines)
This will print a *precise* representation of what is in the first
four lines, plus their lengths. Please show us the output.
 
P

patrick.waldo

lines = open('your_file.txt').readlines()[:4]
print lines
print map(len, lines)

gave me:
['\xef\xbb\xbf200-720-7 69-93-2\n', 'kyselina mo\xc4\x8dov
\xc3\xa1 C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n']
[28, 32, 1, 18]

I think it means that I'm still at option 3. I got the line by line
part. My code is a lot cleaner now:

import codecs

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])] #this doesn't seem to
combine the files correctly
file = u'|'.join(tokens) #this does put '|' in
between
print file + u'\n'
output.write(file + u'\r\n')

input.close()
output.close()

my sample input file looks like this( not organized,as you see it):
200-720-7 69-93-2
kyselina mocová C5H4N4O3

200-001-8 50-00-0
formaldehyd CH2O

200-002-3
50-01-1
guanidínium-chlorid CH5N3.ClH

etc...

and after the program I get:

200-720-7|69-93-2|
kyselina|mocová||C5H4N4O3

200-001-8|50-00-0|
formaldehyd|CH2O|

200-002-3|
50-01-1|
guanidínium-chlorid|CH5N3.ClH|

etc...
So, I am sort of back at the start again.

If I add:

tokens = line.strip().split()
for token in tokens:
print token

I get all the single tokens, which I thought I could then put
together, except when I did:

for token in tokens:
s = u'|'.join(token)
print s

I got ?|2|0|0|-|7|2|0|-|7, etc...

How can I join these together into nice neat little lines? When I try
to store the tokens in a list, the tokens double and I don't know
why. I can work on getting the chemical names together after...baby
steps, or maybe I am just missing something obvious. The first two
numbers will always be the same three digits-three digits-one digit
and then two digits-two digits-one digit. This seems to be on the
only pattern.

My intuition tells me that I need to add an if statement that says, if
the first two numbers follow the pattern, then continue, if they don't
(ie a chemical name was accidently split apart) then the third entry
needs to be put together. Something like

if tokens[1] and tokens[2] startswith('pattern') == true
tokens[2] = join(tokens[2]:tokens[3])
token[3] = token[4]
del token[4]

but the code isn't right...any ideas?

Again, thanks so much. I've gone to http://gnosis.cx/TPiP/ and I have
a couple O'Reilly books, but they don't seem to have a straightforward
example for this kind of text manipulation.

Patrick

I started Python just a little while ago and I am stuck on something
that is really simple, but I just can't figure out.
Essentially I need to take a text document with some chemical
information in Czech and organize it into another text file. The
information is always EINECS number, CAS, chemical name, and formula
in tables. I need to organize them into lines with | in between. So
it goes from:
200-763-1 71-73-8
nátrium-tiopentál C11H18N2O2S.Na to:

but if I have a chemical like: kyselina moÄová
I get:
200-720-7|69-93-2|kyselina|moÄová
|C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
and then it is all off.
How can I get Python to realize that a chemical name may have a space
in it?

Your input file could be in one of THREE formats:
(1) fields are separated by TAB characters (represented in Python by
the escape sequence '\t', and equivalent to '\x09')
(2) fields are fixed width and padded with spaces
(3) fields are separated by a random number of whitespace characters
(and can contain spaces).

What makes you sure that you have format 3? You might like to try
something like
lines = open('your_file.txt').readlines()[:4]
print lines
print map(len, lines)
This will print a *precise* representation of what is in the first
four lines, plus their lengths. Please show us the output.
 
M

Marc 'BlackJack' Rintsch

my sample input file looks like this( not organized,as you see it):
200-720-7 69-93-2
kyselina mocová C5H4N4O3

200-001-8 50-00-0
formaldehyd CH2O

200-002-3
50-01-1
guanidínium-chlorid CH5N3.ClH

etc...

That's quite irregular so it is not that straightforward. One way is to
split everything into words, start a record by taking the first two
elements and then look for the start of the next record that looks like
three numbers concatenated by '-' characters. Quick and dirty hack:

import codecs
import re

NR_RE = re.compile(r'^\d+-\d+-\d+$')

def iter_elements(tokens):
tokens = iter(tokens)
try:
nr_a = tokens.next()
while True:
nr_b = tokens.next()
items = list()
for item in tokens:
if NR_RE.match(item):
yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
nr_a = item
break
else:
items.append(item)
except StopIteration:
yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])



def main():
in_file = codecs.open('test.txt', 'r', 'utf-8')
tokens = in_file.read().split()
in_file.close()
for element in iter_elements(tokens):
print '|'.join(element)

Ciao,
Marc 'BlackJack' Rintsch
 
P

Paul Hankin

my sample input file looks like this( not organized,as you see it):
200-720-7 69-93-2
kyselina mocová C5H4N4O3
200-001-8 50-00-0
formaldehyd CH2O
200-002-3
50-01-1
guanidínium-chlorid CH5N3.ClH

That's quite irregular so it is not that straightforward. One way is to
split everything into words, start a record by taking the first two
elements and then look for the start of the next record that looks like
three numbers concatenated by '-' characters. Quick and dirty hack:

import codecs
import re

NR_RE = re.compile(r'^\d+-\d+-\d+$')

def iter_elements(tokens):
tokens = iter(tokens)
try:
nr_a = tokens.next()
while True:
nr_b = tokens.next()
items = list()
for item in tokens:
if NR_RE.match(item):
yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
nr_a = item
break
else:
items.append(item)
except StopIteration:
yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])

Maybe this is a bit more readable?

def iter_elements(tokens):
chem = []
for tok in tokens:
if NR_RE.match(tok) and len(chem) >= 4:
chem[2:-1] = [' '.join(chem[2:-1])]
yield chem
chem = []
chem.append(tok)
yield chem
 
P

Peter Otten

patrick.waldo said:
my sample input file looks like this( not organized,as you see it):
200-720-7 69-93-2
kyselina mocová C5H4N4O3

200-001-8 50-00-0
formaldehyd CH2O

200-002-3
50-01-1
guanidínium-chlorid CH5N3.ClH

Assuming that the records are always separated by blank lines and only the
third field in a record may contain spaces the following might work:

import codecs
from itertools import groupby

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"

def fields(s):
parts = s.split()
return parts[0], parts[1], " ".join(parts[2:-1]), parts[-1]

def records(instream):
for key, group in groupby(instream, unicode.isspace):
if not key:
yield "".join(group)

if __name__ == "__main__":
outstream = codecs.open(path2, 'w', 'utf8')
for record in records(codecs.open(path, "r", "utf8")):
outstream.write("|".join(fields(record)))
outstream.write("\n")

Peter
 
P

patrick.waldo

Wow, thank you all. All three work. To output correctly I needed to
add:

output.write("\r\n")

This is really a great help!!

Because of my limited Python knowledge, I will need to try to figure
out exactly how they work for future text manipulation and for my own
knowledge. Could you recommend some resources for this kind of text
manipulation? Also, I conceptually get it, but would you mind walking
me through
for tok in tokens:
if NR_RE.match(tok) and len(chem) >= 4:
chem[2:-1] = [' '.join(chem[2:-1])]
yield chem
chem = []
chem.append(tok)
and

for key, group in groupby(instream, unicode.isspace):
if not key:
yield "".join(group)


Thanks again,
Patrick



patrick.waldo said:
my sample input file looks like this( not organized,as you see it):
200-720-7 69-93-2
kyselina mocová C5H4N4O3
200-001-8 50-00-0
formaldehyd CH2O
200-002-3
50-01-1
guanidínium-chlorid CH5N3.ClH

Assuming that the records are always separated by blank lines and only the
third field in a record may contain spaces the following might work:

import codecs
from itertools import groupby

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"

def fields(s):
parts = s.split()
return parts[0], parts[1], " ".join(parts[2:-1]), parts[-1]

def records(instream):
for key, group in groupby(instream, unicode.isspace):
if not key:
yield "".join(group)

if __name__ == "__main__":
outstream = codecs.open(path2, 'w', 'utf8')
for record in records(codecs.open(path, "r", "utf8")):
outstream.write("|".join(fields(record)))
outstream.write("\n")

Peter
 
P

Paul Hankin

Because of my limited Python knowledge, I will need to try to figure
out exactly how they work for future text manipulation and for my own
knowledge. Could you recommend some resources for this kind of text
manipulation? Also, I conceptually get it, but would you mind walking
me through
for tok in tokens:
if NR_RE.match(tok) and len(chem) >= 4:
chem[2:-1] = [' '.join(chem[2:-1])]
yield chem
chem = []
chem.append(tok)

Sure: 'chem' is a list of all the data associated with one chemical.
When a token (tok) arrives that is matched by NR_RE (ie 3 lots of
digits separated by dots), it's assumed that this is the start of a
new chemical if we've already got 4 pieces of data. Then, we join the
name back up (as was explained in earlier posts), and 'yield chem'
yields up the chemical so far; and a new chemical is started (by
emptying the list). Whatever tok is, it's added to the end of the
current chemical data. Add some print statements in to watch it work
if you can't get it.

This code uses exactly the same algorithm as Marc's code - it's just a
bit clearer (or at least, I thought so). Oh, and it returns a list
rather than a tuple, but that makes no difference.
 
P

Paul McGuire

Hi all,

I started Python just a little while ago and I am stuck on something
that is really simple, but I just can't figure out.

Essentially I need to take a text document with some chemical
information in Czech and organize it into another text file.  The
information is always EINECS number, CAS, chemical name, and formula
in tables.  I need to organize them into lines with | in between.  So
it goes from:

200-763-1                     71-73-8
nátrium-tiopentál           C11H18N2O2S.Na           to:

200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

but if I have a chemical like: kyselina moÄová

I get:
200-720-7|69-93-2|kyselina|moÄová
|C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

and then it is all off.

Pyparsing might be overkill for this example, but it is a good sample
for a demo. If you end up doing lots of data extraction like this,
pyparsing is a useful tool. In pyparsing, you define expressions
using pyparsing classes and built-in strings, then use the constructed
pyparsing expression to parse the data (using parseString, scanString,
searchString, or transformString). In this example, searchString is
the easiest to use. After the parsing is done, the parsed fields are
returned in a ParseResults object, which has some list and some dict
style behavior. I've given each field a name based on your post, so
that you can read the tokens right out of the results as if they were
attributes of an object. This example emits your '|' delimited data,
but the commented lines show how you could access the individually
parsed fields, too.

Learn more about pyparsing at http://pyparsing.wikispaces.com/ .

-- Paul


# -*- coding: iso-8859-15 -*-

data = """200-720-7 69-93-2
kyselina mocová C5H4N4O3


200-001-8 50-00-0
formaldehyd CH2O


200-002-3
50-01-1
guanidínium-chlorid CH5N3.ClH

"""

from pyparsing import Word, nums,OneOrMore,alphas,alphas8bit

# define expressions for each part in the input data

# a numeric id starts with a number, and is followed by
# any number of numbers or '-'s
numericId = Word(nums, nums+"-")

# a chemical name is one or more words, each made up of
# alphas (including 8-bit alphas) or '-'s
chemName = OneOrMore(Word(alphas.lower()+alphas8bit.lower()+"-"))

# when returning the chemical name, rejoin the separate
# words into a single string, with spaces
chemName.setParseAction(lambda t:" ".join(t))

# a chemical formula is a 'word' starting with an uppercase
# alpha, followed by uppercase alphas or numbers
chemFormula = Word(alphas.upper(), alphas.upper()+nums)

# put all expressions into overall form, and attach field names
entry = numericId("EINECS") + \
numericId("CAS") + \
chemName("name") + \
chemFormula("formula")

# search through input data, and print out retrieved data
for chemData in entry.searchString(data):
print "%(EINECS)s|%(CAS)s|%(name)s|%(formula)s" % chemData
# or print each field by itself
# print chemData.EINECS
# print chemData.CAS
# print chemData.name
# print chemData.formula
# print


prints:
200-720-7|69-93-2|kyselina mocová|C5H4N4O3
200-001-8|50-00-0|formaldehyd|CH2O
200-002-3|50-01-1|guanidínium-chlorid|CH5N3
 
P

Peter Otten

patrick.waldo said:
manipulation? Also, I conceptually get it, but would you mind walking
me through

itertools.groupby() splits a sequence into groups with the same key; e. g.
to group names by their first letter you'd do the following:
def first_letter(s): return s[:1] ....
for key, group in groupby(["Anne", "Andrew", "Bill", "Brett", "Alex"], first_letter):
.... print "--- %s ---" % key
.... for item in group:
.... print item
....
--- A ---
Anne
Andrew
--- B ---
Bill
Brett
--- A ---
Alex

Note that there are two groups with the same initial; groupby() considers
only consecutive items in the sequence for the same group.

In your case the sequence are the lines in the file, converted to unicode
strings -- the key is a boolean indicating whether the line consists
entirely of whitespace or not,
False

but I call it slightly differently, as an unbound method:
False

This is only possible because all items in the sequence are known to be
unicode instances. So far we have, using a list instead of a file:
instream = [u"alpha\n", u"beta\n", u"\n", u"gamma\n", u"\n", u"\n", u"delta\n"]
for key, group in groupby(instream, unicode.isspace):
.... print "--- %s ---" % key
.... for item in group:
.... print repr(item)
....
--- False ---
u'alpha\n'
u'beta\n'
--- True ---
u'\n'
--- False ---
u'gamma\n'
--- True ---
u'\n'
u'\n'
--- False ---
u'delta\n'

As you see, groups with real data alternate with groups that contain only
blank lines, and the key for the latter is True, so we can skip them with

if not key: # it's not a separator group
yield group

As the final refinement we join all lines of the group into a single
string
u'alpha\nbeta\n'

and that's it.

Peter
 
P

patrick.waldo

And now for something completely different...

I see a lot of COM stuff with Python for excel...and I quickly made
the same program output to excel. What if the input file were a Word
document? Where is there information about manipulating word
documents, or what could I add to make the same program work for word?

Again thanks a lot. I'll start hitting some books about this sort of
text manipulation.

The Excel add on:

import codecs
import re
from win32com.client import Dispatch

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS
number

tokens = input.read().split()
def iter_elements(tokens):
product = []
for tok in tokens:
if NR_RE.match(tok) and len(product) >= 4:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
product.append(tok)
yield product

xlApp = Dispatch("Excel.Application")
xlApp.Visible = 1
xlApp.Workbooks.Add()
c = 1

for element in iter_elements(tokens):
xlApp.ActiveSheet.Cells(c,1).Value = element[0]
xlApp.ActiveSheet.Cells(c,2).Value = element[1]
xlApp.ActiveSheet.Cells(c,3).Value = element[2]
xlApp.ActiveSheet.Cells(c,4).Value = element[3]
c = c + 1

xlApp.ActiveWorkbook.Close(SaveChanges=1)
xlApp.Quit()
xlApp.Visible = 0
del xlApp

input.close()
output.close()
 
P

patrick.waldo

And now for something completely different...

I've been reading up a bit about Python and Excel and I quickly told
the program to output to Excel quite easily. However, what if the
input file were a Word document? I can't seem to find much
information about parsing Word files. What could I add to make the
same program work for a Word file?

Again thanks a lot.

And the Excel Add on...

import codecs
import re
from win32com.client import Dispatch

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS
number

tokens = input.read().split()
def iter_elements(tokens):
product = []
for tok in tokens:
if NR_RE.match(tok) and len(product) >= 4:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
product.append(tok)
yield product

xlApp = Dispatch("Excel.Application")
xlApp.Visible = 1
xlApp.Workbooks.Add()
c = 1

for element in iter_elements(tokens):
xlApp.ActiveSheet.Cells(c,1).Value = element[0]
xlApp.ActiveSheet.Cells(c,2).Value = element[1]
xlApp.ActiveSheet.Cells(c,3).Value = element[2]
xlApp.ActiveSheet.Cells(c,4).Value = element[3]
c = c + 1

xlApp.ActiveWorkbook.Close(SaveChanges=1)
xlApp.Quit()
xlApp.Visible = 0
del xlApp

input.close()
output.close()
 
T

Tim Roberts

And now for something completely different...

I've been reading up a bit about Python and Excel and I quickly told
the program to output to Excel quite easily. However, what if the
input file were a Word document? I can't seem to find much
information about parsing Word files. What could I add to make the
same program work for a Word file?

Word files are not human-readable. You parse them using
Dispatch("Word.Application"), just the way you wrote the Excel file.

I believe there are some third-party modules that will read a Word file a
little more directly.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,740
Latest member
JudsonFrie

Latest Threads

Top