Regular Expression

patrick.waldo · Oct 22, 2007

Hi,

I'm trying to learn regular expressions, but I am having trouble with
this. I want to search a document that has mixed data; however, the
last line of every entry has something like C5H4N4O3 or CH5N3.ClH.
All of the letters are upper case and there will always be numbers and
possibly one .

However below only gave me none.

import os, codecs, re

text = 'C:\\text_samples\\sample.txt'
text = codecs.open(text,'r','utf-8')

test = re.compile('\u+\d+\.')

for line in text:
print test.search(line)

Marc 'BlackJack' Rintsch · Oct 22, 2007

I'm trying to learn regular expressions, but I am having trouble with
this. I want to search a document that has mixed data; however, the
last line of every entry has something like C5H4N4O3 or CH5N3.ClH.
All of the letters are upper case and there will always be numbers and
possibly one .

However below only gave me none.

[â€¦]

test = re.compile('\u+\d+\.')

There is no '\u'. 'u' doesn't have a special meaning so the '\' is
pointless. Your expression matches one or more small 'u's followed by one
or more digits followed by a period. Examples are 'u1.', 'uuuuuuuu42.',
etc.

An expression that matches your first example would be: r'([A-Z]|\d|\.)+'.
That's a non-empty sequence of upper case letters, digits and periods. To
limit this to just one optional period the expression gets a little
longer: r'([A-Z]|\d)+\.?([A-Z]|\d)+'

Does not match your second example because there is a lower case letter in
it.

Ciao,
Marc 'BlackJack' Rintsch

Shawn Milochik · Oct 23, 2007

Hi,

I'm trying to learn regular expressions, but I am having trouble with
this. I want to search a document that has mixed data; however, the
last line of every entry has something like C5H4N4O3 or CH5N3.ClH.
All of the letters are upper case and there will always be numbers and
possibly one .

However below only gave me none.

import os, codecs, re

text = 'C:\\text_samples\\sample.txt'
text = codecs.open(text,'r','utf-8')

test = re.compile('\u+\d+\.')

for line in text:
print test.search(line)

I need a little more info. How can you know whether you're matching
the text you're going for, and not other data which looks similar? Do
you have a specific field length? Is it guaranteed to contain a digit?
Is it required to start with a letter? Does it always start with 'C'?
You need to have those kinds of rules in mind to write your regex.

Shawn

Paul McGuire · Oct 23, 2007

Hi,

I'm trying to learn regular expressions, but I am having trouble with
this. I want to search a document that has mixed data; however, the
last line of every entry has something like C5H4N4O3 or CH5N3.ClH.
All of the letters are upper case and there will always be numbers and
possibly one .

However below only gave me none.

import os, codecs, re

text = 'C:\\text_samples\\sample.txt'
text = codecs.open(text,'r','utf-8')

test = re.compile('\u+\d+\.')

for line in text:
print test.search(line)

If those are chemical symbols, then I guarantee that there will be
lower case letters in the expression (like the "l" in "ClH").

-- Paul

patrick.waldo · Oct 23, 2007

This is related to my last post (see:
http://groups.google.com/group/comp...bbb5d496584/998af2bb2ca10e88#998af2bb2ca10e88)

I have a text file with an EINECS number, a CAS number, a Chemical
Name, and a Chemical Formula, always in this order. However, I
realized as I ran my script that I had entries like

274-989-4
70892-58-9
diazotovaná kyselina 4-
aminobenzénsulfónová, kopulovaná s
farbiarskym morušovým (Chlorophora
tinctoria) extraktom, komplexy so
železom
komplexy železa s produktami
kopulácie diazotovanej kyseliny 4-
aminobenzénsulfónovej s látkou
registrovanou v Indexe farieb pod
identifika ným íslom Indexu farieb,
C.I. 75240.

which become

274-989-4|70892-58-9|diazotovaná kyselina 4- aminobenzénsulfónová,
kopulovaná s farbiarskym morušovým (Chlorophora tinctoria) extraktom,
komplexy so železom komplexy železa s produktami kopulácie
diazotovanej kyseliny 4- aminobenzénsulfónovej s látkou registrovanou
v Indexe farieb pod identifika ným íslom Indexu farieb, C.I.|75240.

The C.I 75240 is not a chemical formula and there isn't one. So I
want to add a regular expression for the chemical name for an if
statement that stipulates if there is not chemical formula to move
on. However, I must be getting confused from the regular expression
tutorials I've been reading.

Any ideas?

Original Code:

#For text files in a directory...
#Analyzes a randomly organized UTF8 document with EINECS, CAS,
Chemical, and Chemical Formula
#into a document structured as EINECS|CAS|Chemical|Chemical Formula.

import os
import codecs
import re

path = "C:\\text_samples\\text" #folder with all text
files
path2 = "C:\\text_samples\\text\\output" #output of all text
files

NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS
number

def iter_elements(tokens):
product = []
for tok in tokens:
if NR_RE.match(tok) and len(product) >= 4:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
product.append(tok)
yield product

for text in os.listdir(path):
input_text = os.path.join(path,text)
output_text = os.path.join(path2,text)
input = codecs.open(input_text, 'r','utf8')
output = codecs.open(output_text, 'w', 'utf8')
tokens = input.read().split()
for element in iter_elements(tokens):
#print '|'.join(element)
output.write('|'.join(element))
output.write("\r\n")

input.close()
output.close()

patrick.waldo · Oct 25, 2007

Marc, thank you for the example it made me realize where I was getting
things wrong. I didn't realize how specific I needed to be. Also
http://weitz.de/regex-coach/ really helped me test things out on this
one. I realized I had some more exceptions like C18H34O2.1/2Cu and I
also realized I didn't really understand regular expressions (which I
still don't but I think it's getting better)

FORMULA = re.compile(r'([A-Z][A-Za-z0-9]+\.?[A-Za-z0-9]+/?[A-Za-
z0-9]+)')

This gets all Chemical names like C14H28 C18H34O2.1/2Cu C8H17ClO2, ie
a word that begins with a capital letter followed by any number of
upper or lower case letters and numbers followed by a possible .
followed by any number of upper or lower case letters and numbers
followed by a possible / followed by any number of upper or lower case
letters and numbers. Say that five times fast!

So now I want to tell the program that if it finds the formula at the
end then continue, otherwise if it finds C.I. 75240 or any other type
of word that it should not be broken by a | and be lumped into the
whole line. But now I get:

Traceback (most recent call last):
File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 310, in RunScript
exec codeObject in __main__.__dict__
File "C:\Documents and Settings\Patrick Waldo\My Documents\Python
\WORD\try5-2-file-1-1.py", line 32, in ?
input = codecs.open(input_text, 'r','utf8')
File "C:\Python24\lib\codecs.py", line 666, in open
file = __builtin__.open(filename, mode, buffering)
IOError: [Errno 13] Permission denied: 'C:\\Documents and Settings\
\Patrick Waldo\\Desktop\\decernis\\DAD\\EINECS_SK\\text\\output'

Ideas?

#For text files in a directory...
#Analyzes a randomly organized UTF8 document with EINECS, CAS,
Chemical, and Chemical Formula
#into a document structured as EINECS|CAS|Chemical|Chemical Formula.

import os
import codecs
import re

path = "C:\\text"
path2 = "C:\\text\output"
EINECS = re.compile(r'^\d\d\d-\d\d\d-\d
$')
FORMULA = re.compile(r'([A-Z][A-Za-z0-9]+\.?[A-Za-z0-9]+/?[A-Za-
z0-9]+)')

def iter_elements(tokens):
product = []
for tok in tokens:
if EINECS.match(tok) and len(product) >= 4:
if product[-1] == FORMULA.findall(tok):
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
else:
product[2:-1] = [' '.join(product[2:])]
yield product
product = []
product.append(tok)
yield product

for text in os.listdir(path):
input_text = os.path.join(path,text)
output_text = os.path.join(path2,text)
input = codecs.open(input_text, 'r','utf8')
output = codecs.open(output_text, 'w', 'utf8')
tokens = input.read().split()
for element in iter_elements(tokens):
output.write('|'.join(element))
output.write("\r\n")

input.close()
output.close()

patrick.waldo · Oct 27, 2007

Finally I solved the problem, with some really minor things to tweak.
I guess it's true that I had two problems working with regular
expressions.

Thank you all for your help. I really learned a lot on quite a
difficult problem.

Final Code:

#For text files in a directory...
#Analyzes a randomly organized UTF8 document with EINECS, CAS,
Chemical, and Chemical Formula
#into a document structured as EINECS|CAS|Chemical|Chemical Formula.

import os
import codecs
import re

path = "C:\\text_samples\\text\\"
path2 = "C:\\text_samples\\text\\output\\"
EINECS = re.compile(r'^\d\d\d-\d\d\d-\d$')
CAS = re.compile(r'^\d*-\d\d-\d$')
FORMULA = re.compile(r'([A-Z][A-Za-z0-9]+\.?[A-Za-z0-9]+/?[A-Za-
z0-9]+)')

def iter_elements(tokens):
product = []
for tok in tokens:
if EINECS.match(tok) and len(product) >= 4:
match = re.match(FORMULA,product[-1])
if match:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
else:
product[2:-1] = [' '.join(product[2:])]
del product[-1]
yield product
product = []
product.append(tok)
yield product

for text in os.listdir(path):
input_text = os.path.join(path,text)
output_text = os.path.join(path2,text)
input = codecs.open(input_text, 'r','utf8')
output = codecs.open(output_text, 'w', 'utf8')
tokens = input.read().split()
for element in iter_elements(tokens):
output.write('|'.join(element))
output.write("\r\n")
input.close()
output.close()

Regular expression	0	Jul 21, 2009
Question: Optional Regular Expression Grouping	4	Oct 10, 2011
Problem creating a regular expression to parse open-iscsi, iscsiadmoutput (help?)	5	Jun 13, 2013
Regular Expression Groups - loop	3	Aug 7, 2007
regular expression extracting groups	3	Aug 10, 2008
Regular Expression for Finding and Deleting comments	1	Jan 4, 2011
Help with regular expression in python	1	Aug 18, 2011
How do I get the text that is found by a regular expression?	10	Apr 30, 2014

Regular Expression

patrick.waldo

Marc 'BlackJack' Rintsch

Shawn Milochik

Paul McGuire

patrick.waldo

patrick.waldo

patrick.waldo

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads