Regular Expression

P

patrick.waldo

Hi,

I'm trying to learn regular expressions, but I am having trouble with
this. I want to search a document that has mixed data; however, the
last line of every entry has something like C5H4N4O3 or CH5N3.ClH.
All of the letters are upper case and there will always be numbers and
possibly one .

However below only gave me none.

import os, codecs, re

text = 'C:\\text_samples\\sample.txt'
text = codecs.open(text,'r','utf-8')

test = re.compile('\u+\d+\.')

for line in text:
print test.search(line)
 
M

Marc 'BlackJack' Rintsch

I'm trying to learn regular expressions, but I am having trouble with
this. I want to search a document that has mixed data; however, the
last line of every entry has something like C5H4N4O3 or CH5N3.ClH.
All of the letters are upper case and there will always be numbers and
possibly one .

However below only gave me none.

[…]

test = re.compile('\u+\d+\.')

There is no '\u'. 'u' doesn't have a special meaning so the '\' is
pointless. Your expression matches one or more small 'u's followed by one
or more digits followed by a period. Examples are 'u1.', 'uuuuuuuu42.',
etc.

An expression that matches your first example would be: r'([A-Z]|\d|\.)+'.
That's a non-empty sequence of upper case letters, digits and periods. To
limit this to just one optional period the expression gets a little
longer: r'([A-Z]|\d)+\.?([A-Z]|\d)+'

Does not match your second example because there is a lower case letter in
it.

Ciao,
Marc 'BlackJack' Rintsch
 
S

Shawn Milochik

Hi,

I'm trying to learn regular expressions, but I am having trouble with
this. I want to search a document that has mixed data; however, the
last line of every entry has something like C5H4N4O3 or CH5N3.ClH.
All of the letters are upper case and there will always be numbers and
possibly one .

However below only gave me none.

import os, codecs, re

text = 'C:\\text_samples\\sample.txt'
text = codecs.open(text,'r','utf-8')

test = re.compile('\u+\d+\.')

for line in text:
print test.search(line)


I need a little more info. How can you know whether you're matching
the text you're going for, and not other data which looks similar? Do
you have a specific field length? Is it guaranteed to contain a digit?
Is it required to start with a letter? Does it always start with 'C'?
You need to have those kinds of rules in mind to write your regex.

Shawn
 
P

Paul McGuire

Hi,

I'm trying to learn regular expressions, but I am having trouble with
this. I want to search a document that has mixed data; however, the
last line of every entry has something like C5H4N4O3 or CH5N3.ClH.
All of the letters are upper case and there will always be numbers and
possibly one .

However below only gave me none.

import os, codecs, re

text = 'C:\\text_samples\\sample.txt'
text = codecs.open(text,'r','utf-8')

test = re.compile('\u+\d+\.')

for line in text:
print test.search(line)

If those are chemical symbols, then I guarantee that there will be
lower case letters in the expression (like the "l" in "ClH").

-- Paul
 
P

patrick.waldo

This is related to my last post (see:
http://groups.google.com/group/comp...bbb5d496584/998af2bb2ca10e88#998af2bb2ca10e88)

I have a text file with an EINECS number, a CAS number, a Chemical
Name, and a Chemical Formula, always in this order. However, I
realized as I ran my script that I had entries like

274-989-4
70892-58-9
diazotovaná kyselina 4-
aminobenzénsulfónová, kopulovaná s
farbiarskym morušovým (Chlorophora
tinctoria) extraktom, komplexy so
železom
komplexy železa s produktami
kopulácie diazotovanej kyseliny 4-
aminobenzénsulfónovej s látkou
registrovanou v Indexe farieb pod
identifika ným íslom Indexu farieb,
C.I. 75240.

which become

274-989-4|70892-58-9|diazotovaná kyselina 4- aminobenzénsulfónová,
kopulovaná s farbiarskym morušovým (Chlorophora tinctoria) extraktom,
komplexy so železom komplexy železa s produktami kopulácie
diazotovanej kyseliny 4- aminobenzénsulfónovej s látkou registrovanou
v Indexe farieb pod identifika ným íslom Indexu farieb, C.I.|75240.

The C.I 75240 is not a chemical formula and there isn't one. So I
want to add a regular expression for the chemical name for an if
statement that stipulates if there is not chemical formula to move
on. However, I must be getting confused from the regular expression
tutorials I've been reading.

Any ideas?

Original Code:

#For text files in a directory...
#Analyzes a randomly organized UTF8 document with EINECS, CAS,
Chemical, and Chemical Formula
#into a document structured as EINECS|CAS|Chemical|Chemical Formula.

import os
import codecs
import re

path = "C:\\text_samples\\text" #folder with all text
files
path2 = "C:\\text_samples\\text\\output" #output of all text
files

NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS
number

def iter_elements(tokens):
product = []
for tok in tokens:
if NR_RE.match(tok) and len(product) >= 4:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
product.append(tok)
yield product

for text in os.listdir(path):
input_text = os.path.join(path,text)
output_text = os.path.join(path2,text)
input = codecs.open(input_text, 'r','utf8')
output = codecs.open(output_text, 'w', 'utf8')
tokens = input.read().split()
for element in iter_elements(tokens):
#print '|'.join(element)
output.write('|'.join(element))
output.write("\r\n")


input.close()
output.close()
 
P

patrick.waldo

Marc, thank you for the example it made me realize where I was getting
things wrong. I didn't realize how specific I needed to be. Also
http://weitz.de/regex-coach/ really helped me test things out on this
one. I realized I had some more exceptions like C18H34O2.1/2Cu and I
also realized I didn't really understand regular expressions (which I
still don't but I think it's getting better)

FORMULA = re.compile(r'([A-Z][A-Za-z0-9]+\.?[A-Za-z0-9]+/?[A-Za-
z0-9]+)')

This gets all Chemical names like C14H28 C18H34O2.1/2Cu C8H17ClO2, ie
a word that begins with a capital letter followed by any number of
upper or lower case letters and numbers followed by a possible .
followed by any number of upper or lower case letters and numbers
followed by a possible / followed by any number of upper or lower case
letters and numbers. Say that five times fast!

So now I want to tell the program that if it finds the formula at the
end then continue, otherwise if it finds C.I. 75240 or any other type
of word that it should not be broken by a | and be lumped into the
whole line. But now I get:

Traceback (most recent call last):
File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 310, in RunScript
exec codeObject in __main__.__dict__
File "C:\Documents and Settings\Patrick Waldo\My Documents\Python
\WORD\try5-2-file-1-1.py", line 32, in ?
input = codecs.open(input_text, 'r','utf8')
File "C:\Python24\lib\codecs.py", line 666, in open
file = __builtin__.open(filename, mode, buffering)
IOError: [Errno 13] Permission denied: 'C:\\Documents and Settings\
\Patrick Waldo\\Desktop\\decernis\\DAD\\EINECS_SK\\text\\output'

Ideas?


#For text files in a directory...
#Analyzes a randomly organized UTF8 document with EINECS, CAS,
Chemical, and Chemical Formula
#into a document structured as EINECS|CAS|Chemical|Chemical Formula.

import os
import codecs
import re

path = "C:\\text"
path2 = "C:\\text\output"
EINECS = re.compile(r'^\d\d\d-\d\d\d-\d
$')
FORMULA = re.compile(r'([A-Z][A-Za-z0-9]+\.?[A-Za-z0-9]+/?[A-Za-
z0-9]+)')

def iter_elements(tokens):
product = []
for tok in tokens:
if EINECS.match(tok) and len(product) >= 4:
if product[-1] == FORMULA.findall(tok):
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
else:
product[2:-1] = [' '.join(product[2:])]
yield product
product = []
product.append(tok)
yield product

for text in os.listdir(path):
input_text = os.path.join(path,text)
output_text = os.path.join(path2,text)
input = codecs.open(input_text, 'r','utf8')
output = codecs.open(output_text, 'w', 'utf8')
tokens = input.read().split()
for element in iter_elements(tokens):
output.write('|'.join(element))
output.write("\r\n")

input.close()
output.close()
 
P

patrick.waldo

Finally I solved the problem, with some really minor things to tweak.
I guess it's true that I had two problems working with regular
expressions.

Thank you all for your help. I really learned a lot on quite a
difficult problem.

Final Code:

#For text files in a directory...
#Analyzes a randomly organized UTF8 document with EINECS, CAS,
Chemical, and Chemical Formula
#into a document structured as EINECS|CAS|Chemical|Chemical Formula.

import os
import codecs
import re

path = "C:\\text_samples\\text\\"
path2 = "C:\\text_samples\\text\\output\\"
EINECS = re.compile(r'^\d\d\d-\d\d\d-\d$')
CAS = re.compile(r'^\d*-\d\d-\d$')
FORMULA = re.compile(r'([A-Z][A-Za-z0-9]+\.?[A-Za-z0-9]+/?[A-Za-
z0-9]+)')


def iter_elements(tokens):
product = []
for tok in tokens:
if EINECS.match(tok) and len(product) >= 4:
match = re.match(FORMULA,product[-1])
if match:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
else:
product[2:-1] = [' '.join(product[2:])]
del product[-1]
yield product
product = []
product.append(tok)
yield product

for text in os.listdir(path):
input_text = os.path.join(path,text)
output_text = os.path.join(path2,text)
input = codecs.open(input_text, 'r','utf8')
output = codecs.open(output_text, 'w', 'utf8')
tokens = input.read().split()
for element in iter_elements(tokens):
output.write('|'.join(element))
output.write("\r\n")
input.close()
output.close()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,821
Latest member
AleidaSchi

Latest Threads

Top