regular expression to extract text

Mark Light · Nov 20, 2003

Hi I have a file read in as a string that looks like below. What I want to
do is pull out the bits of information to eventually put in an html table.
FOr the 1st example the 3 bits are:
1.QEXZUO
2. C26 H31 N1 O3
3. 6.164 15.892 22.551 90.00 90.00 90.00

ANy ideas of the best way to do this - I was trying regular expressions but
not getting very far.

Thanks,

Mark.

"""
Using unit cell orientation matrix from collect.rmat
NOTICE: Performing automatic cell standardization
The following database entries have similar unit cells:
Refcode Sumformula
<Conventional cell parameters>

Peter Hansen · Nov 20, 2003

Mark said:
Hi I have a file read in as a string that looks like below. What I want to
do is pull out the bits of information to eventually put in an html table.
FOr the 1st example the 3 bits are:
1.QEXZUO
2. C26 H31 N1 O3
3. 6.164 15.892 22.551 90.00 90.00 90.00

ANy ideas of the best way to do this - I was trying regular expressions but
not getting very far.

Thanks,

Mark.

"""
Using unit cell orientation matrix from collect.rmat
NOTICE: Performing automatic cell standardization
The following database entries have similar unit cells:
Refcode Sumformula
<Conventional cell parameters>

I don't think you've given enough information here. Are those
"bits" supposed to be kept intact, complete with internal spacing,
or are you doing more manipulation of them? What is the definition
of the "bits"? Specifically, is bit 1 "the first non-space token
after a line of hyphens"? Is bit 2 "everything on the line after
bit 1, with leading and trailing spaces stripped"? Is bit 3
"everything on the following line, with leading/trailing spaces
stripped"?

Those definitions roughly fit what you describe, and if that's
all you need, the solution should be pretty trivial, without
having to use regular expressions which would be overkill in this
case.

Fredrik Lundh · Nov 20, 2003

Mark said:
Hi I have a file read in as a string that looks like below. What I want to
do is pull out the bits of information to eventually put in an html table.
FOr the 1st example the 3 bits are:
1.QEXZUO
2. C26 H31 N1 O3
3. 6.164 15.892 22.551 90.00 90.00 90.00

ANy ideas of the best way to do this - I was trying regular expressions but
not getting very far.

here's one way to do it:

data = """
Using unit cell orientation matrix from collect.rmat
NOTICE: Performing automatic cell standardization
The following database entries have similar unit cells:
Refcode Sumformula
<Conventional cell parameters>
------------------------------------------
QEXZUO C26 H31 N1 O3
6.164 15.892 22.551 90.00 90.00 90.00
------------------------------------------
ARQTYD C19 H23 N1 O5
6.001 15.227 22.558 90.00 90.00 90.00
------------------------------------------
NHDIIS C45 H40 Cl2
6.532 15.147 22.453 90.00 90.00 90.00 """

from StringIO import StringIO

file = StringIO(data)

for line in file:
if line.startswith("---"):
part1, part2 = file.readline().strip().split(None, 1)
part3 = file.readline().strip()
print "1.", part1
print "2.", part2
print "3.", part3

</F>

Mark Light · Nov 20, 2003

Peter Hansen said:
I don't think you've given enough information here. Are those
"bits" supposed to be kept intact, complete with internal spacing,
or are you doing more manipulation of them? What is the definition
of the "bits"? Specifically, is bit 1 "the first non-space token
after a line of hyphens"? Is bit 2 "everything on the line after
bit 1, with leading and trailing spaces stripped"? Is bit 3
"everything on the following line, with leading/trailing spaces
stripped"?

Those definitions roughly fit what you describe, and if that's
all you need, the solution should be pretty trivial, without
having to use regular expressions which would be overkill in this
case.

Sorry for being inexact - the definitions you proposed do fit the bill.

Mark.

Roel Mathys · Nov 20, 2003

Although I hold no grudge against regexes, I've overused them myself in
the past (it's a bit rusty). But nowadays I prefer to use them less and
less.

bye,
rm

ps: I don't know what the purpose really was, but I gave it a little
shot anyway.

------------------------------------------------------------------------

text = """
Using unit cell orientation matrix from collect.rmat
NOTICE: Performing automatic cell standardization
The following database entries have similar unit cells:
Refcode Sumformula
<Conventional cell parameters>
------------------------------------------
QEXZUO C26 H31 N1 O3
6.164 15.892 22.551 90.00 90.00 90.00
------------------------------------------
ARQTYD C19 H23 N1 O5
6.001 15.227 22.558 90.00 90.00 90.00
------------------------------------------
NHDIIS C45 H40 Cl2
6.532 15.147 22.453 90.00 90.00 90.00 """

result = {}
refcode = None
started = False
for line in text.split('\n') :
if not started \
and line == '------------------------------------------' :
started = True
continue
if started :
if refcode is None :
fields = line.split()
refcode = fields[0]
sumformula = fields[1:]
else :
cellparams = map( float , line.split())
# assuming refcode is unique
result[refcode] = { 'sumformula' : sumformula
, 'cellparams' : cellparams
}
refcode = None
started = False

from pprint import pprint

pprint( result )

Lonnie Princehouse · Nov 20, 2003

One of the beautiful things about Python is that,
while there is usually one obvious and reasonable
way to do something, there are many many ridiculous
ways to do it as well. This is especially true when
regular expressions are involved.

I'd do it like this: (Note that this wants the whole file as
one string, so use read() instead of readline())

data = """
Using unit cell orientation matrix from collect.rmat
NOTICE: Performing automatic cell standardization
The following database entries have similar unit cells:
Refcode Sumformula
<Conventional cell parameters>
------------------------------------------
QEXZUO C26 H31 N1 O3
6.164 15.892 22.551 90.00 90.00 90.00
------------------------------------------
ARQTYD C19 H23 N1 O5
6.001 15.227 22.558 90.00 90.00 90.00
------------------------------------------
NHDIIS C45 H40 Cl2
6.532 15.147 22.453 90.00 90.00 90.00 """

import re

r1 = re.compile('\-+\n([A-Z]+)(.*?)(?:\-|$)', re.DOTALL)
r2 = re.compile('([A-Z]+\d+)', re.I)
r3 = re.compile('(\d+\.\d+)')

results = dict([ (name, {
'isotopes': r2.findall(body),
'values': [float(x) for x in r3.findall(body)]
}) for name, body in r1.findall(data) ])

I assumes that you want the numbers as floats instead of strings;
if you're just going to print them out again, don't call float().

I also assume (perhaps wrongly) that the order of entries isn't
important. Don't do the dict() conversion if that assumption's wrong.

This yields:

{'ARQTYD': {'isotopes': ['C19', 'H23', 'N1', 'O5'],
'values': [6.0010000000000003,
15.227,
22.558,
90.0,
90.0,
90.0]},
'NHDIIS': {'isotopes': ['C45', 'H40', 'Cl2'],
'values': [6.532,
15.147,
22.452999999999999,
90.0,
90.0,
90.0]},
'QEXZUO': {'isotopes': ['C26', 'H31', 'N1', 'O3'],
'values': [6.1639999999999997,
15.891999999999999,
22.550999999999998,
90.0,
90.0,
90.0]}}

regular expression ,... and session variables.....	1	Sep 9, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Apr 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Feb 15, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 15, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Nov 15, 2007

regular expression to extract text

Mark Light

Peter Hansen

Fredrik Lundh

Mark Light

Roel Mathys

Lonnie Princehouse

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads