regular expression to extract text

M

Mark Light

Hi I have a file read in as a string that looks like below. What I want to
do is pull out the bits of information to eventually put in an html table.
FOr the 1st example the 3 bits are:
1.QEXZUO
2. C26 H31 N1 O3
3. 6.164 15.892 22.551 90.00 90.00 90.00

ANy ideas of the best way to do this - I was trying regular expressions but
not getting very far.

Thanks,

Mark.






"""
Using unit cell orientation matrix from collect.rmat
NOTICE: Performing automatic cell standardization
The following database entries have similar unit cells:
Refcode Sumformula
<Conventional cell parameters>
 
P

Peter Hansen

Mark said:
Hi I have a file read in as a string that looks like below. What I want to
do is pull out the bits of information to eventually put in an html table.
FOr the 1st example the 3 bits are:
1.QEXZUO
2. C26 H31 N1 O3
3. 6.164 15.892 22.551 90.00 90.00 90.00

ANy ideas of the best way to do this - I was trying regular expressions but
not getting very far.

Thanks,

Mark.

"""
Using unit cell orientation matrix from collect.rmat
NOTICE: Performing automatic cell standardization
The following database entries have similar unit cells:
Refcode Sumformula
<Conventional cell parameters>

I don't think you've given enough information here. Are those
"bits" supposed to be kept intact, complete with internal spacing,
or are you doing more manipulation of them? What is the definition
of the "bits"? Specifically, is bit 1 "the first non-space token
after a line of hyphens"? Is bit 2 "everything on the line after
bit 1, with leading and trailing spaces stripped"? Is bit 3
"everything on the following line, with leading/trailing spaces
stripped"?

Those definitions roughly fit what you describe, and if that's
all you need, the solution should be pretty trivial, without
having to use regular expressions which would be overkill in this
case.
 
F

Fredrik Lundh

Mark said:
Hi I have a file read in as a string that looks like below. What I want to
do is pull out the bits of information to eventually put in an html table.
FOr the 1st example the 3 bits are:
1.QEXZUO
2. C26 H31 N1 O3
3. 6.164 15.892 22.551 90.00 90.00 90.00

ANy ideas of the best way to do this - I was trying regular expressions but
not getting very far.

here's one way to do it:

data = """
Using unit cell orientation matrix from collect.rmat
NOTICE: Performing automatic cell standardization
The following database entries have similar unit cells:
Refcode Sumformula
<Conventional cell parameters>
------------------------------------------
QEXZUO C26 H31 N1 O3
6.164 15.892 22.551 90.00 90.00 90.00
------------------------------------------
ARQTYD C19 H23 N1 O5
6.001 15.227 22.558 90.00 90.00 90.00
------------------------------------------
NHDIIS C45 H40 Cl2
6.532 15.147 22.453 90.00 90.00 90.00 """

from StringIO import StringIO

file = StringIO(data)

for line in file:
if line.startswith("---"):
part1, part2 = file.readline().strip().split(None, 1)
part3 = file.readline().strip()
print "1.", part1
print "2.", part2
print "3.", part3

</F>
 
M

Mark Light

Peter Hansen said:
I don't think you've given enough information here. Are those
"bits" supposed to be kept intact, complete with internal spacing,
or are you doing more manipulation of them? What is the definition
of the "bits"? Specifically, is bit 1 "the first non-space token
after a line of hyphens"? Is bit 2 "everything on the line after
bit 1, with leading and trailing spaces stripped"? Is bit 3
"everything on the following line, with leading/trailing spaces
stripped"?

Those definitions roughly fit what you describe, and if that's
all you need, the solution should be pretty trivial, without
having to use regular expressions which would be overkill in this
case.


Sorry for being inexact - the definitions you proposed do fit the bill.

Mark.
 
R

Roel Mathys

Although I hold no grudge against regexes, I've overused them myself in
the past (it's a bit rusty). But nowadays I prefer to use them less and
less.

bye,
rm

ps: I don't know what the purpose really was, but I gave it a little
shot anyway.

------------------------------------------------------------------------

text = """
Using unit cell orientation matrix from collect.rmat
NOTICE: Performing automatic cell standardization
The following database entries have similar unit cells:
Refcode Sumformula
<Conventional cell parameters>
------------------------------------------
QEXZUO C26 H31 N1 O3
6.164 15.892 22.551 90.00 90.00 90.00
------------------------------------------
ARQTYD C19 H23 N1 O5
6.001 15.227 22.558 90.00 90.00 90.00
------------------------------------------
NHDIIS C45 H40 Cl2
6.532 15.147 22.453 90.00 90.00 90.00 """

result = {}
refcode = None
started = False
for line in text.split('\n') :
if not started \
and line == '------------------------------------------' :
started = True
continue
if started :
if refcode is None :
fields = line.split()
refcode = fields[0]
sumformula = fields[1:]
else :
cellparams = map( float , line.split())
# assuming refcode is unique
result[refcode] = { 'sumformula' : sumformula
, 'cellparams' : cellparams
}
refcode = None
started = False

from pprint import pprint

pprint( result )
 
L

Lonnie Princehouse

One of the beautiful things about Python is that,
while there is usually one obvious and reasonable
way to do something, there are many many ridiculous
ways to do it as well. This is especially true when
regular expressions are involved.

I'd do it like this: (Note that this wants the whole file as
one string, so use read() instead of readline())


data = """
Using unit cell orientation matrix from collect.rmat
NOTICE: Performing automatic cell standardization
The following database entries have similar unit cells:
Refcode Sumformula
<Conventional cell parameters>
------------------------------------------
QEXZUO C26 H31 N1 O3
6.164 15.892 22.551 90.00 90.00 90.00
------------------------------------------
ARQTYD C19 H23 N1 O5
6.001 15.227 22.558 90.00 90.00 90.00
------------------------------------------
NHDIIS C45 H40 Cl2
6.532 15.147 22.453 90.00 90.00 90.00 """

import re

r1 = re.compile('\-+\n([A-Z]+)(.*?)(?:\-|$)', re.DOTALL)
r2 = re.compile('([A-Z]+\d+)', re.I)
r3 = re.compile('(\d+\.\d+)')

results = dict([ (name, {
'isotopes': r2.findall(body),
'values': [float(x) for x in r3.findall(body)]
}) for name, body in r1.findall(data) ])



I assumes that you want the numbers as floats instead of strings;
if you're just going to print them out again, don't call float().

I also assume (perhaps wrongly) that the order of entries isn't
important. Don't do the dict() conversion if that assumption's wrong.

This yields:

{'ARQTYD': {'isotopes': ['C19', 'H23', 'N1', 'O5'],
'values': [6.0010000000000003,
15.227,
22.558,
90.0,
90.0,
90.0]},
'NHDIIS': {'isotopes': ['C45', 'H40', 'Cl2'],
'values': [6.532,
15.147,
22.452999999999999,
90.0,
90.0,
90.0]},
'QEXZUO': {'isotopes': ['C26', 'H31', 'N1', 'O3'],
'values': [6.1639999999999997,
15.891999999999999,
22.550999999999998,
90.0,
90.0,
90.0]}}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,170
Messages
2,570,927
Members
47,469
Latest member
benny001

Latest Threads

Top