A nice way to use regex for complicate parsing

aspineux · Mar 29, 2007

My goal is to write a parser for these imaginary string from the SMTP
protocol, regarding RFC 821 and 1869.
I'm a little flexible with the BNF from these RFC

Any comment ?

tests=[ 'MAIL FROM:<[email protected]>',
'MAIL FROM:[email protected]',
'MAIL FROM:<[email protected]> SIZE=1234
[email protected]',
'MAIL FROM:[email protected] SIZE=1234
[email protected]',
'MAIL FROM:<"(e-mail address removed)> legal=email"@address.com>',
'MAIL FROM:"(e-mail address removed)> legal=email"@address.com',
'MAIL FROM:<"(e-mail address removed)> legal=email"@address.com> SIZE=1234
[email protected]',
'MAIL FROM:"(e-mail address removed)> legal=email"@address.com SIZE=1234
[email protected]',
]

def RN(name, regex):
"""protect using () and give an optional name to a regex"""
if name:
return r'(?P<%s>%s)' % (name, regex)
else:
return r'(?:%s)' % regex

regex={}

# <dotnum> ::= <snum> "." <snum> "." <snum> "." <snum>
regex['dotnum']=RN(None, r'[012]?\d?\d\.[012]?\d?\d\.[012]?\d?\d\.
[012]?\d?\d' % regex)
# <dot-string> ::= <string> | <string> "." <dot-string>
regex['dot_string']=RN(None, r'[a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)*' %
regex)
# <domain> ::= <element> | <element> "." <domain>
regex['domain']=RN('domain', r'%(dotnum)s|%(dot_string)s' % regex)
# <q> ::= any one of the 128 ASCII characters except <CR>, <LF>, quote
("), or backslash (\)
regex['q']=RN(None, r'[^\n\r"\\]' % regex)
# <x> ::= any one of the 128 ASCII characters (no exceptions)
regex['x']=RN(None, r'.' % regex)
# <qtext> ::= "\" <x> | "\" <x> <qtext> | <q> | <q> <qtext>
regex['qtext']=RN(None, r'(?:\\%(x)s|%(q)s)+' % regex)
# <quoted-string> ::= """ <qtext> """
regex['quoted_string']=RN('quoted_string', r'"%(qtext)s"' % regex)
# <local-part> ::= <dot-string> | <quoted-string>
regex['local_part']=RN('local_part', r'%(quoted_string)s|%
(dot_string)s' % regex)
# <mailbox> ::= <local-part> "@" <domain>
regex['mailbox']=RN('mailbox', r'%(local_part)s@%(domain)s' % regex)
# <path> ::= "<" [ <a-d-l> ":" ] <mailbox> ">"
# also accept address without <>
regex['path']=RN('path', r'(?P<path_lt><)?%(mailbox)s(?(path_lt)>)' %
regex)
# esmtp-keyword ::= (ALPHA / DIGIT) *(ALPHA / DIGIT / "-")
regex['esmtp_keyword']=RN(None, r'[a-zA-Z0-9][-a-zA-Z0-9]*' % regex)
# esmtp-value ::= 1*<any CHAR excluding "=", SP, and all ;
syntax and values depend on esmtp-keyword
# control characters (US ASCII 0-31inclusive)>
regex['esmtp_value']=RN(None, r'[^= \t\r\n\f\v]*' % regex)
# esmtp-parameter ::= esmtp-keyword ["=" esmtp-value]
regex['esmtp_parameter']=RN(None, r'%(esmtp_keyword)s(?:=%
(esmtp_value)s)?' % regex)
# esmtp-parameter ::= esmtp-keyword ["=" esmtp-value]
regex['esmtp_parameters']=RN('esmtp_parameters', r'%
(esmtp_parameter)s(?:\s+%(esmtp_parameter)s)+' % regex)
# esmtp-cmd ::= inner-esmtp-cmd [SP esmtp-parameters] CR LF
regex['esmtp_addr']=RN('esmtp_addr', r'%(path)s(?:\s+%
(esmtp_parameters)s)?' % regex)

for t in tests:
for keyword in [ 'MAIL FROM:', 'RCPT TO:' ]:
keylen=len(keyword)
if t[:keylen].upper()==keyword:
t=t[keylen:]
break

match=re.match(regex['esmtp_addr'], t)
if match:
print 'MATCH local_part=%(local_part)s domain=%(domain)s
esmtp_parameters=%(esmtp_parameters)s' % match.groupdict()
else:
print 'DONT match', t

Shane Geiger · Mar 29, 2007

It would be worth learning pyparsing to do this.

My goal is to write a parser for these imaginary string from the SMTP
protocol, regarding RFC 821 and 1869.
I'm a little flexible with the BNF from these RFC
Any comment ?

tests=[ 'MAIL FROM:<[email protected]>',
'MAIL FROM:[email protected]',
'MAIL FROM:<[email protected]> SIZE=1234
[email protected]',
'MAIL FROM:[email protected] SIZE=1234
[email protected]',
'MAIL FROM:<"(e-mail address removed)> legal=email"@address.com>',
'MAIL FROM:"(e-mail address removed)> legal=email"@address.com',
'MAIL FROM:<"(e-mail address removed)> legal=email"@address.com> SIZE=1234
[email protected]',
'MAIL FROM:"(e-mail address removed)> legal=email"@address.com SIZE=1234
[email protected]',
]

def RN(name, regex):
"""protect using () and give an optional name to a regex"""
if name:
return r'(?P<%s>%s)' % (name, regex)
else:
return r'(?:%s)' % regex

regex={}

# <dotnum> ::= <snum> "." <snum> "." <snum> "." <snum>
regex['dotnum']=RN(None, r'[012]?\d?\d\.[012]?\d?\d\.[012]?\d?\d\.
[012]?\d?\d' % regex)
# <dot-string> ::= <string> | <string> "." <dot-string>
regex['dot_string']=RN(None, r'[a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)*' %
regex)
# <domain> ::= <element> | <element> "." <domain>
regex['domain']=RN('domain', r'%(dotnum)s|%(dot_string)s' % regex)
# <q> ::= any one of the 128 ASCII characters except <CR>, <LF>, quote
("), or backslash (\)
regex['q']=RN(None, r'[^\n\r"\\]' % regex)
# <x> ::= any one of the 128 ASCII characters (no exceptions)
regex['x']=RN(None, r'.' % regex)
# <qtext> ::= "\" <x> | "\" <x> <qtext> | <q> | <q> <qtext>
regex['qtext']=RN(None, r'(?:\\%(x)s|%(q)s)+' % regex)
# <quoted-string> ::= """ <qtext> """
regex['quoted_string']=RN('quoted_string', r'"%(qtext)s"' % regex)
# <local-part> ::= <dot-string> | <quoted-string>
regex['local_part']=RN('local_part', r'%(quoted_string)s|%
(dot_string)s' % regex)
# <mailbox> ::= <local-part> "@" <domain>
regex['mailbox']=RN('mailbox', r'%(local_part)s@%(domain)s' % regex)
# <path> ::= "<" [ <a-d-l> ":" ] <mailbox> ">"
# also accept address without <>
regex['path']=RN('path', r'(?P<path_lt><)?%(mailbox)s(?(path_lt)>)' %
regex)
# esmtp-keyword ::= (ALPHA / DIGIT) *(ALPHA / DIGIT / "-")
regex['esmtp_keyword']=RN(None, r'[a-zA-Z0-9][-a-zA-Z0-9]*' % regex)
# esmtp-value ::= 1*<any CHAR excluding "=", SP, and all ;
syntax and values depend on esmtp-keyword
# control characters (US ASCII 0-31inclusive)>
regex['esmtp_value']=RN(None, r'[^= \t\r\n\f\v]*' % regex)
# esmtp-parameter ::= esmtp-keyword ["=" esmtp-value]
regex['esmtp_parameter']=RN(None, r'%(esmtp_keyword)s(?:=%
(esmtp_value)s)?' % regex)
# esmtp-parameter ::= esmtp-keyword ["=" esmtp-value]
regex['esmtp_parameters']=RN('esmtp_parameters', r'%
(esmtp_parameter)s(?:\s+%(esmtp_parameter)s)+' % regex)
# esmtp-cmd ::= inner-esmtp-cmd [SP esmtp-parameters] CR LF
regex['esmtp_addr']=RN('esmtp_addr', r'%(path)s(?:\s+%
(esmtp_parameters)s)?' % regex)

for t in tests:
for keyword in [ 'MAIL FROM:', 'RCPT TO:' ]:
keylen=len(keyword)
if t[:keylen].upper()==keyword:
t=t[keylen:]
break

match=re.match(regex['esmtp_addr'], t)
if match:
print 'MATCH local_part=%(local_part)s domain=%(domain)s
esmtp_parameters=%(esmtp_parameters)s' % match.groupdict()
else:
print 'DONT match', t

--
Shane Geiger
IT Director
National Council on Economic Education
(e-mail address removed) | 402-438-8958 | http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy

Paul McGuire · Mar 29, 2007

It would be worth learning pyparsing to do this.

Thanks to Shane and Steven for the ref to pyparsing. I also was
struck by this post, thinking "this is pyparsing written in re's and
dicts".

The approach you are taking is *very* much like the thought process I
went through when first implementing pyparsing. I wanted to easily
compose expressions from other expressions. In your case, you are
string interpolating using a cumulative dict of prior expressions.
Pyparsing uses various subclasses of the ParserElement class, with
operator definitions for alternation ("|" or "^" depending on non-
greedy vs. greedy), composition ("+"), and negation ("~"). Pyparsing
also uses its own extended results construct, ParseResults, which
supports named results fields, accessible using list indicies, dict
names, or instance names.

Here is the pyparsing treatment of your example (I may not have gotten
every part correct, but my point is more the similarity of our
approaches). Note the access to the smtp parameters via the Dict
transformer.

-- Paul

from pyparsing import *

# <dotnum> ::= <snum> "." <snum> "." <snum> "." <snum>
intgr = Word(nums)
dotnum = Combine(intgr + "." + intgr + "." + intgr + "." + intgr)

# <dot-string> ::= <string> | <string> "." <dot-string>
string_ = Word(alphanums)
dotstring = Combine(delimitedList(string_,"."))

# <domain> ::= <element> | <element> "." <domain>
domain = dotnum | dotstring

# <q> ::= any one of the 128 ASCII characters except <CR>, <LF>, quote
("), or backslash (\)
# <x> ::= any one of the 128 ASCII characters (no exceptions)
# <qtext> ::= "\" <x> | "\" <x> <qtext> | <q> | <q> <qtext>
# <quoted-string> ::= """ <qtext> """
quotedString = dblQuotedString # <- just use pre-defined expr from
pyparsing

# <local-part> ::= <dot-string> | <quoted-string>
localpart = (dotstring | quotedString).setResultsName("localpart")

# <mailbox> ::= <local-part> "@" <domain>
mailbox = Combine(localpart + "@" + domain).setResultsName("mailbox")

# <path> ::= "<" [ <a-d-l> ":" ] <mailbox> ">"
# also accept address without <>
path = "<" + mailbox + ">" | mailbox

# esmtp-keyword ::= (ALPHA / DIGIT) *(ALPHA / DIGIT / "-")
esmtpkeyword = Word(alphanums,alphanums+"-")

# esmtp-value ::= 1*<any CHAR excluding "=", SP, and all
esmtpvalue = Regex(r'[^= \t\r\n\f\v]*')

# ; syntax and values depend on esmtp-keyword
# control characters (US ASCII 0-31inclusive)>
# esmtp-parameter ::= esmtp-keyword ["=" esmtp-value]
# esmtp-parameter ::= esmtp-keyword ["=" esmtp-value]
esmtpparameters = Dict(
ZeroOrMore( Group(esmtpkeyword + Suppress("=") + esmtpvalue) ) )

# esmtp-cmd ::= inner-esmtp-cmd [SP esmtp-parameters] CR LF
esmtp_addr = path + \
Optional(esmtpparameters,default=[])\
.setResultsName("parameters")

for t in tests:
for keyword in [ 'MAIL FROM:', 'RCPT TO:' ]:
keylen=len(keyword)
if t[:keylen].upper()==keyword:
t=t[keylen:]
break

try:
match = esmtp_addr.parseString(t)
print 'MATCH'
print match.dump()
# some sample code to access elements of the parameters
"dict"
if "SIZE" in match.parameters:
print "SIZE is", match.parameters.SIZE
print
except ParseException,pe:
print 'DONT match', t

prints:
MATCH
['<', ['johnsmith@addresscom'], '>']
- mailbox: ['johnsmith@addresscom']
- localpart: johnsmith
- parameters: []

MATCH
[['johnsmith@addresscom']]
- mailbox: ['johnsmith@addresscom']
- localpart: johnsmith
- parameters: []

MATCH
['<', ['johnsmith@addresscom'], '>', ['SIZE', '1234'], ['OTHER',
'(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
- mailbox: ['johnsmith@addresscom']
- localpart: johnsmith
- parameters: [['SIZE', '1234'], ['OTHER', '(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
SIZE is 1234

MATCH
[['johnsmith@addresscom'], ['SIZE', '1234'], ['OTHER', '(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
- mailbox: ['johnsmith@addresscom']
- localpart: johnsmith
- parameters: [['SIZE', '1234'], ['OTHER', '(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
SIZE is 1234

MATCH
['<', ['"(e-mail address removed)> legal=email"@addresscom'], '>']
- mailbox: ['"(e-mail address removed)> legal=email"@addresscom']
- localpart: "(e-mail address removed)> legal=email"
- parameters: []

MATCH
[['"(e-mail address removed)> legal=email"@addresscom']]
- mailbox: ['"(e-mail address removed)> legal=email"@addresscom']
- localpart: "(e-mail address removed)> legal=email"
- parameters: []

MATCH
['<', ['"(e-mail address removed)> legal=email"@addresscom'], '>', ['SIZE', '1234'],
['OTHER', '(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
- mailbox: ['"(e-mail address removed)> legal=email"@addresscom']
- localpart: "(e-mail address removed)> legal=email"
- parameters: [['SIZE', '1234'], ['OTHER', '(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
SIZE is 1234

MATCH
[['"(e-mail address removed)> legal=email"@addresscom'], ['SIZE', '1234'], ['OTHER',
'(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
- mailbox: ['"(e-mail address removed)> legal=email"@addresscom']
- localpart: "(e-mail address removed)> legal=email"
- parameters: [['SIZE', '1234'], ['OTHER', '(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
SIZE is 1234

aspineux · Mar 30, 2007

Thanks to Shane and Steven for the ref to pyparsing. I also was
struck by this post, thinking "this is pyparsing written in re's and
dicts".

My first idea was : why learn a parsing library if I can do it using
're'
and dicts

The approach you are taking is *very* much like the thought process I
went through when first implementing pyparsing. I wanted to easily
compose expressions from other expressions. In your case, you are
string interpolating using a cumulative dict of prior expressions.
Pyparsing uses various subclasses of the ParserElement class, with
operator definitions for alternation ("|" or "^" depending on non-
greedy vs. greedy), composition ("+"), and negation ("~"). Pyparsing
also uses its own extended results construct, ParseResults, which
supports named results fields, accessible using list indicies, dict
names, or instance names.

Here is the pyparsing treatment of your example (I may not have gotten
every part correct, but my point is more the similarity of our
approaches). Note the access to the smtp parameters via the Dict
transformer.

-- Paul

Thanks !

Any parsing library I used before were heavy to start with.
The benefit was inversely proportional to the size of the project.
Your look to be lighter, and the results are more easily usable.

Thanks for showing me your lib.

Anyway today I will keep my idea for small parsing.

Alain

from pyparsing import *

# <dotnum> ::= <snum> "." <snum> "." <snum> "." <snum>
intgr = Word(nums)
dotnum = Combine(intgr + "." + intgr + "." + intgr + "." + intgr)

# <dot-string> ::= <string> | <string> "." <dot-string>
string_ = Word(alphanums)
dotstring = Combine(delimitedList(string_,"."))

# <domain> ::= <element> | <element> "." <domain>
domain = dotnum | dotstring

# <q> ::= any one of the 128 ASCII characters except <CR>, <LF>, quote
("), or backslash (\)
# <x> ::= any one of the 128 ASCII characters (no exceptions)
# <qtext> ::= "\" <x> | "\" <x> <qtext> | <q> | <q> <qtext>
# <quoted-string> ::= """ <qtext> """
quotedString = dblQuotedString # <- just use pre-defined expr from
pyparsing

# <local-part> ::= <dot-string> | <quoted-string>
localpart = (dotstring | quotedString).setResultsName("localpart")

# <mailbox> ::= <local-part> "@" <domain>
mailbox = Combine(localpart + "@" + domain).setResultsName("mailbox")

# <path> ::= "<" [ <a-d-l> ":" ] <mailbox> ">"
# also accept address without <>
path = "<" + mailbox + ">" | mailbox

# esmtp-keyword ::= (ALPHA / DIGIT) *(ALPHA / DIGIT / "-")
esmtpkeyword = Word(alphanums,alphanums+"-")

# esmtp-value ::= 1*<any CHAR excluding "=", SP, and all
esmtpvalue = Regex(r'[^= \t\r\n\f\v]*')

# ; syntax and values depend on esmtp-keyword
# control characters (US ASCII 0-31inclusive)>
# esmtp-parameter ::= esmtp-keyword ["=" esmtp-value]
# esmtp-parameter ::= esmtp-keyword ["=" esmtp-value]
esmtpparameters = Dict(
ZeroOrMore( Group(esmtpkeyword + Suppress("=") + esmtpvalue) ) )

# esmtp-cmd ::= inner-esmtp-cmd [SP esmtp-parameters] CR LF
esmtp_addr = path + \
Optional(esmtpparameters,default=[])\
.setResultsName("parameters")

for t in tests:
for keyword in [ 'MAIL FROM:', 'RCPT TO:' ]:
keylen=len(keyword)
if t[:keylen].upper()==keyword:
t=t[keylen:]
break

try:
match = esmtp_addr.parseString(t)
print 'MATCH'
print match.dump()
# some sample code to access elements of the parameters
"dict"
if "SIZE" in match.parameters:
print "SIZE is", match.parameters.SIZE
print
except ParseException,pe:
print 'DONT match', t

prints:
MATCH
['<', ['johnsmith@addresscom'], '>']
- mailbox: ['johnsmith@addresscom']
- localpart: johnsmith
- parameters: []

MATCH
[['johnsmith@addresscom']]
- mailbox: ['johnsmith@addresscom']
- localpart: johnsmith
- parameters: []

MATCH
['<', ['johnsmith@addresscom'], '>', ['SIZE', '1234'], ['OTHER',
'(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
- mailbox: ['johnsmith@addresscom']
- localpart: johnsmith
- parameters: [['SIZE', '1234'], ['OTHER', '(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
SIZE is 1234

MATCH
[['johnsmith@addresscom'], ['SIZE', '1234'], ['OTHER', '(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
- mailbox: ['johnsmith@addresscom']
- localpart: johnsmith
- parameters: [['SIZE', '1234'], ['OTHER', '(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
SIZE is 1234

MATCH
['<', ['"(e-mail address removed)> legal=email"@addresscom'], '>']
- mailbox: ['"(e-mail address removed)> legal=email"@addresscom']
- localpart: "(e-mail address removed)> legal=email"
- parameters: []

MATCH
[['"(e-mail address removed)> legal=email"@addresscom']]
- mailbox: ['"(e-mail address removed)> legal=email"@addresscom']
- localpart: "(e-mail address removed)> legal=email"
- parameters: []

MATCH
['<', ['"(e-mail address removed)> legal=email"@addresscom'], '>', ['SIZE', '1234'],
['OTHER', '(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
- mailbox: ['"(e-mail address removed)> legal=email"@addresscom']
- localpart: "(e-mail address removed)> legal=email"
- parameters: [['SIZE', '1234'], ['OTHER', '(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
SIZE is 1234

MATCH
[['"(e-mail address removed)> legal=email"@addresscom'], ['SIZE', '1234'], ['OTHER',
'(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
- mailbox: ['"(e-mail address removed)> legal=email"@addresscom']
- localpart: "(e-mail address removed)> legal=email"
- parameters: [['SIZE', '1234'], ['OTHER', '(e-mail address removed)']]
- OTHER: (e-mail address removed)
- SIZE: 1234
SIZE is 1234

HOWTO: Parsing email using Python part2	1	Jul 15, 2011
HOWTO: Parsing email using Python part1	2	Jul 3, 2011
FAQ 6.9 How can I quote a variable to use in a regex?	10	Apr 12, 2011
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004
Can't see the forest for the trees - when reading file, only processingfirst line	5	Apr 13, 2006
anybody help me	1	Feb 10, 2006
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
Beginner User having issue with converting char to ASCII	8	Sep 12, 2008

A nice way to use regex for complicate parsing

aspineux

Shane Geiger

Paul McGuire

aspineux

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads