P
Phrogz
(I would post this to the treetop mailing list...except there doesn't
seem to be one.)
Warming up to treetop, at lunch today I ported the official RFC2822
email address specification to treetop. (As a grammar that allows
recursive comments, this is notoriously impossible to match exactly
correctly using regular expressions.)
C:\>irb
irb(main):001:0> require 'treetop'
irb(main):002:0> require 'email_address'
irb(main):003:0> @e = EmailAddressParser.new
irb(main):004:0> @e.parse( 'foo' )
=> nil
irb(main):005:0> @e.parse( '(e-mail address removed)' ).pieces
=> ["foo", "bar.com"]
irb(main):006:0> @e.parse( '!@[69.46.18.236]' ).pieces
=> ["!", "[69.46.18.236]"]
irb(main):007:0> @e.parse( '"Gavin Kistner"@[69.46.18.236]' ).pieces
=> ["\"Gavin Kistner\"", "[69.46.18.236]"]
The only 'cheat' I did was to omit any of the branches of the grammar
that were marked as obsolete (and yet included?) in the spec.
I'm not 100% sure that this is working perfectly, however. From the
ABNF spec:
addr-spec = local-part "@" domain
domain = dot-atom / domain-literal / obs-domain
domain-literal = [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]
CFWS = *([FWS] comment) (([FWS] comment) / FWS)
comment = "(" *([FWS] ccontent) [FWS] ")"
ccontent = ctext / quoted-pair / comment
ctext = NO-WS-CTL / ; Non white space controls
%d33-39 / ; The rest of the US-ASCII
%d42-91 / ; characters not including "(",
%d93-126 ; ")", or "\"
From the above, I would think that the following was a legal email
address:
irb(main):008:0> @e.parse( 'foo@[0.0.0.0](ok)' )
=> nil
As shown, however, Treetop fails to parse it. Did I fail to port ABNF
to Treetop properly? Here are the relevant pieces:
rule addr_spec
local_part "@" domain
end
rule domain
dot_atom / domain_literal
end
rule domain_literal
CFWS? "[" (FWS? dcontent)* FWS? "]" CFWS?
end
rule CFWS
(FWS? comment)* ((FWS? comment) / FWS)
end
rule comment
"(" ( FWS? ccontent )* FWS? ")"
end
rule ccontent
ctext / quoted_pair / comment
end
rule ctext
NO_WS_CTL / [\x21-\x27\x2a-\x5b\x5d-\x7e]
end
Following is the full RFC2282 Treetop grammar, in case anyone wants to
play with it.
# http://tools.ietf.org/html/rfc2822
# All obsolete rules have been removed
grammar EmailAddress
rule addr_spec
local_part "@" domain {
def pieces
[ local_part.text_value, domain.text_value ]
end
}
end
rule local_part
dot_atom / quoted_string
end
rule domain
dot_atom / domain_literal
end
rule domain_literal
CFWS? "[" (FWS? dcontent)* FWS? "]" CFWS?
end
rule dcontent
dtext / quoted_pair
end
rule dtext
NO_WS_CTL / # Non white space controls
[\x21-\x5a\x5e-\x7e] # The rest of the US-ASCII characters
# not including "[", "]", or "\"
end
# Non-whitespace control characters
rule NO_WS_CTL
[\x01-\x08\x0b-\x0c\x0e-\x1f\x7f]
end
rule dot_atom
CFWS? dot_atom_text CFWS?
end
rule dot_atom_text
atext+ ( "." atext+ )*
end
# folding white space
rule FWS
(WSP* CRLF)? WSP+
end
rule CFWS
(FWS? comment)* ((FWS? comment) / FWS)
end
rule CRLF
"\r\n"
end
rule WSP
[ \t]
end
# Any character except controls, SP, and specials.
rule atext
ALPHA / DIGIT / [!#$\%&'*+\/=?^_`{|}~-]
end
rule ALPHA
[A-Za-z]
end
rule DIGIT
[0-9]
end
rule text
[\x01-\x09\x0b-\x0c\x0e-\x7f]
end
rule specials
[()<>\[\]:;@\\,.] / DQUOTE
end
rule DQUOTE
'"'
end
rule ccontent
ctext / quoted_pair / comment
end
rule quoted_pair
"\\" text
end
rule qtext
NO_WS_CTL / # Non white space controls
[0x21\x23-\x5b\x5d-\x7e] # The rest of the US-ASCII characters
# not including "\" or the quote
character
end
rule qcontent
qtext / quoted_pair
end
rule quoted_string
CFWS? DQUOTE (FWS? qcontent)* FWS? DQUOTE CFWS?
end
rule comment
"(" ( FWS? ccontent )* FWS? ")"
end
rule ctext
NO_WS_CTL / # Non white space controls
[\x21-\x27\x2a-\x5b\x5d-\x7e] # The rest of the US-ASCII
characters
# not including "(", ")", or "\"
end
end
seem to be one.)
Warming up to treetop, at lunch today I ported the official RFC2822
email address specification to treetop. (As a grammar that allows
recursive comments, this is notoriously impossible to match exactly
correctly using regular expressions.)
C:\>irb
irb(main):001:0> require 'treetop'
irb(main):002:0> require 'email_address'
irb(main):003:0> @e = EmailAddressParser.new
irb(main):004:0> @e.parse( 'foo' )
=> nil
irb(main):005:0> @e.parse( '(e-mail address removed)' ).pieces
=> ["foo", "bar.com"]
irb(main):006:0> @e.parse( '!@[69.46.18.236]' ).pieces
=> ["!", "[69.46.18.236]"]
irb(main):007:0> @e.parse( '"Gavin Kistner"@[69.46.18.236]' ).pieces
=> ["\"Gavin Kistner\"", "[69.46.18.236]"]
The only 'cheat' I did was to omit any of the branches of the grammar
that were marked as obsolete (and yet included?) in the spec.
I'm not 100% sure that this is working perfectly, however. From the
ABNF spec:
addr-spec = local-part "@" domain
domain = dot-atom / domain-literal / obs-domain
domain-literal = [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]
CFWS = *([FWS] comment) (([FWS] comment) / FWS)
comment = "(" *([FWS] ccontent) [FWS] ")"
ccontent = ctext / quoted-pair / comment
ctext = NO-WS-CTL / ; Non white space controls
%d33-39 / ; The rest of the US-ASCII
%d42-91 / ; characters not including "(",
%d93-126 ; ")", or "\"
From the above, I would think that the following was a legal email
address:
irb(main):008:0> @e.parse( 'foo@[0.0.0.0](ok)' )
=> nil
As shown, however, Treetop fails to parse it. Did I fail to port ABNF
to Treetop properly? Here are the relevant pieces:
rule addr_spec
local_part "@" domain
end
rule domain
dot_atom / domain_literal
end
rule domain_literal
CFWS? "[" (FWS? dcontent)* FWS? "]" CFWS?
end
rule CFWS
(FWS? comment)* ((FWS? comment) / FWS)
end
rule comment
"(" ( FWS? ccontent )* FWS? ")"
end
rule ccontent
ctext / quoted_pair / comment
end
rule ctext
NO_WS_CTL / [\x21-\x27\x2a-\x5b\x5d-\x7e]
end
Following is the full RFC2282 Treetop grammar, in case anyone wants to
play with it.
# http://tools.ietf.org/html/rfc2822
# All obsolete rules have been removed
grammar EmailAddress
rule addr_spec
local_part "@" domain {
def pieces
[ local_part.text_value, domain.text_value ]
end
}
end
rule local_part
dot_atom / quoted_string
end
rule domain
dot_atom / domain_literal
end
rule domain_literal
CFWS? "[" (FWS? dcontent)* FWS? "]" CFWS?
end
rule dcontent
dtext / quoted_pair
end
rule dtext
NO_WS_CTL / # Non white space controls
[\x21-\x5a\x5e-\x7e] # The rest of the US-ASCII characters
# not including "[", "]", or "\"
end
# Non-whitespace control characters
rule NO_WS_CTL
[\x01-\x08\x0b-\x0c\x0e-\x1f\x7f]
end
rule dot_atom
CFWS? dot_atom_text CFWS?
end
rule dot_atom_text
atext+ ( "." atext+ )*
end
# folding white space
rule FWS
(WSP* CRLF)? WSP+
end
rule CFWS
(FWS? comment)* ((FWS? comment) / FWS)
end
rule CRLF
"\r\n"
end
rule WSP
[ \t]
end
# Any character except controls, SP, and specials.
rule atext
ALPHA / DIGIT / [!#$\%&'*+\/=?^_`{|}~-]
end
rule ALPHA
[A-Za-z]
end
rule DIGIT
[0-9]
end
rule text
[\x01-\x09\x0b-\x0c\x0e-\x7f]
end
rule specials
[()<>\[\]:;@\\,.] / DQUOTE
end
rule DQUOTE
'"'
end
rule ccontent
ctext / quoted_pair / comment
end
rule quoted_pair
"\\" text
end
rule qtext
NO_WS_CTL / # Non white space controls
[0x21\x23-\x5b\x5d-\x7e] # The rest of the US-ASCII characters
# not including "\" or the quote
character
end
rule qcontent
qtext / quoted_pair
end
rule quoted_string
CFWS? DQUOTE (FWS? qcontent)* FWS? DQUOTE CFWS?
end
rule comment
"(" ( FWS? ccontent )* FWS? ")"
end
rule ctext
NO_WS_CTL / # Non white space controls
[\x21-\x27\x2a-\x5b\x5d-\x7e] # The rest of the US-ASCII
characters
# not including "(", ")", or "\"
end
end