Treetop Email Parser

Phrogz · Jan 28, 2008

(I would post this to the treetop mailing list...except there doesn't
seem to be one.)

Warming up to treetop, at lunch today I ported the official RFC2822
email address specification to treetop. (As a grammar that allows
recursive comments, this is notoriously impossible to match exactly
correctly using regular expressions.)

C:\>irb
irb(main):001:0> require 'treetop'
irb(main):002:0> require 'email_address'
irb(main):003:0> @e = EmailAddressParser.new
irb(main):004:0> @e.parse( 'foo' )
=> nil
irb(main):005:0> @e.parse( '(e-mail address removed)' ).pieces
=> ["foo", "bar.com"]
irb(main):006:0> @e.parse( '!@[69.46.18.236]' ).pieces
=> ["!", "[69.46.18.236]"]
irb(main):007:0> @e.parse( '"Gavin Kistner"@[69.46.18.236]' ).pieces
=> ["\"Gavin Kistner\"", "[69.46.18.236]"]

The only 'cheat' I did was to omit any of the branches of the grammar
that were marked as obsolete (and yet included?) in the spec.

I'm not 100% sure that this is working perfectly, however. From the
ABNF spec:
addr-spec = local-part "@" domain
domain = dot-atom / domain-literal / obs-domain
domain-literal = [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]
CFWS = *([FWS] comment) (([FWS] comment) / FWS)
comment = "(" *([FWS] ccontent) [FWS] ")"
ccontent = ctext / quoted-pair / comment
ctext = NO-WS-CTL / ; Non white space controls
%d33-39 / ; The rest of the US-ASCII
%d42-91 / ; characters not including "(",
%d93-126 ; ")", or "\"

From the above, I would think that the following was a legal email
address:
irb(main):008:0> @e.parse( 'foo@[0.0.0.0](ok)' )
=> nil

As shown, however, Treetop fails to parse it. Did I fail to port ABNF
to Treetop properly? Here are the relevant pieces:
rule addr_spec
local_part "@" domain
end
rule domain
dot_atom / domain_literal
end
rule domain_literal
CFWS? "[" (FWS? dcontent)* FWS? "]" CFWS?
end
rule CFWS
(FWS? comment)* ((FWS? comment) / FWS)
end
rule comment
"(" ( FWS? ccontent )* FWS? ")"
end
rule ccontent
ctext / quoted_pair / comment
end
rule ctext
NO_WS_CTL / [\x21-\x27\x2a-\x5b\x5d-\x7e]
end

Following is the full RFC2282 Treetop grammar, in case anyone wants to
play with it.

# http://tools.ietf.org/html/rfc2822
# All obsolete rules have been removed
grammar EmailAddress

rule addr_spec
local_part "@" domain {
def pieces
[ local_part.text_value, domain.text_value ]
end
}
end

rule local_part
dot_atom / quoted_string
end

rule domain
dot_atom / domain_literal
end

rule domain_literal
CFWS? "[" (FWS? dcontent)* FWS? "]" CFWS?
end

rule dcontent
dtext / quoted_pair
end

rule dtext
NO_WS_CTL / # Non white space controls
[\x21-\x5a\x5e-\x7e] # The rest of the US-ASCII characters
# not including "[", "]", or "\"
end

# Non-whitespace control characters
rule NO_WS_CTL
[\x01-\x08\x0b-\x0c\x0e-\x1f\x7f]
end

rule dot_atom
CFWS? dot_atom_text CFWS?
end

rule dot_atom_text
atext+ ( "." atext+ )*
end

# folding white space
rule FWS
(WSP* CRLF)? WSP+
end

rule CFWS
(FWS? comment)* ((FWS? comment) / FWS)
end

rule CRLF
"\r\n"
end

rule WSP
[ \t]
end
# Any character except controls, SP, and specials.
rule atext
ALPHA / DIGIT / [!#$\%&'*+\/=?^_`{|}~-]
end

rule ALPHA
[A-Za-z]
end

rule DIGIT
[0-9]
end

rule text
[\x01-\x09\x0b-\x0c\x0e-\x7f]
end

rule specials
[()<>\[\]:;@\\,.] / DQUOTE
end

rule DQUOTE
'"'
end

rule ccontent
ctext / quoted_pair / comment
end

rule quoted_pair
"\\" text
end

rule qtext
NO_WS_CTL / # Non white space controls
[0x21\x23-\x5b\x5d-\x7e] # The rest of the US-ASCII characters
# not including "\" or the quote
character
end

rule qcontent
qtext / quoted_pair
end

rule quoted_string
CFWS? DQUOTE (FWS? qcontent)* FWS? DQUOTE CFWS?
end

rule comment
"(" ( FWS? ccontent )* FWS? ")"
end

rule ctext
NO_WS_CTL / # Non white space controls
[\x21-\x27\x2a-\x5b\x5d-\x7e] # The rest of the US-ASCII
characters
# not including "(", ")", or "\"
end
end

Clifford Heath · Jan 28, 2008

Phrogz said:
(I would post this to the treetop mailing list...except there doesn't
seem to be one.)

It's fine here. A lot of folk here could benefit from using Treetop,
if only they could work out how to. Discussion here can help them.

Your problem lies in the direct translation of this rule, which is
poorly written in the RFC:

CFWS = *([FWS] comment) (([FWS] comment) / FWS)
to:

rule CFWS
(FWS? comment)* ((FWS? comment) / FWS)
end

The issue is that the first parenthesized group is a node that
will eat too much, succeed, then cause the second group to fail.
This is a feature of PEG parsers in general: once a node has
succeeded, it will never be retried even though there may be a
different way it can succeed. Any backtracking that comes to the
same point will remember that it succeeded here once before, and
will assume that the *same* success is correct.

You need to rewrite it to avoid this incorrect success. This rule
is equivalent:

rule CFWS
( FWS / comment )+
end

Just a bit simpler, eh?

Clifford Heath.

P.S. It'd be nice to know whether my other Treetop messages were useful to you...?

Phrogz · Jan 29, 2008

It's fine here. A lot of folk here could benefit from using Treetop,
if only they could work out how to. Discussion here can help them.

Sure. I'd also like a place to send bug reports, however. So far, I've
been sending them to the only contact address I can find for the
entire project - Nathan's email as listed in the gemspec.

Your problem lies in the direct translation of this rule, which is
poorly written in the RFC:

CFWS = *([FWS] comment) (([FWS] comment)/ FWS)
to:

rule CFWS
(FWS? comment)* ((FWS? comment) / FWS)
end

Click to expand...

The issue is that the first parenthesized group is a node that
will eat too much, succeed, then cause the second group to fail.
This is a feature of PEG parsers in general: once a node has
succeeded, it will never be retried even though there may be a
different way it can succeed. Any backtracking that comes to the
same point will remember that it succeeded here once before, and
will assume that the *same* success is correct.

You need to rewrite it to avoid this incorrect success. This rule
is equivalent:

rule CFWS
( FWS / comment )+
end

Just a bit simpler, eh?

You know, I looked at that rule (and some like it) and said "Man, what
were they smoking!?" But yeah, this was a quick translation over lunch
and I explicitly didn't want to try translating anything.

To be fair, that's not quite the same rule, right? Your rule would
allow FWS FWS FWS, while the original requires a comment between them.
(FWS itself only eats one CRLF at a time, so this change allows blank
lines where they were not allowed before. I think.)

I appreciate the help and insight. This insight scares me, I must
admit.

I was happy to accept the no-left-recursion limitation of the grammar,
because it's easy to spot and the downside is infinite recursion. If
it wasn't obvious when writing the grammar, it would be reasonably
obvious at runtime when your parse never finished.

But what you describe here seems far more insidious. This limitation
(which I assume is not in fact limited to PEG in general, but
specifically to the packrat memoization technique that makes this
solution perform reasonably) says to me: There are some cases where
you can write something that looks correct and technically *is*
correct, but that will silently fail on otherwise valid content.
Please (definitely) correct me if I'm wrong. I definitely don't want
to spread FUD.

Can you provide any guidelines or reading points on how to recognize a
set of rules that may fail like this? Is it limited to use of the zero-
or-more star symbol? Particularly likely to occur when any star or
plus quantifier appears at the start of a rule?

Thanks again for your help. I'm still excited about treetop, but less
so as I find that I actually have to know some theory about the
language I'm writing in (*gasp*) instead of being able to naively
throw together BNF-like rules and have it magically all work out.

Phrogz · Jan 29, 2008

....
... This limitation
(which I assume is not in fact limited to PEG in general, but
specifically to the packrat memoization technique that makes this
solution perform reasonably) ...

My apologies for second guessing you. Having read the PEG paper[1], I
see now what you're saying. This PEG rule:
[aeiou]+ 'e'
behaves quite differently from this regular expression:
/[aeiou]+e/

I had thought - based on the use of the terms 'greedy' and
'backtracking' - that the above PEG rule would be able to match a
string like "aaae", just like the regexp. I see now that PEG in
general (packrat or not) are really, REALLY greedy. The above PEG rule
doesn't match that string because [aeiou]+ consumes the whole string,
and when it fails to find an 'e', it does NOT backtrack on the number
of repetitions of the character class.

Wow.

So...how would you write a PEG rule to match "a string of any number
of vowels, that ends with an 'e'"?

[1] http://pdos.csail.mit.edu/~baford/packrat/popl04/peg-popl04.pdf

Clifford Heath · Jan 29, 2008

Phrogz said:
You know, I looked at that rule (and some like it) and said "Man, what
were they smoking!?" But yeah, this was a quick translation over lunch
and I explicitly didn't want to try translating anything.

To be fair, that's not quite the same rule, right? Your rule would
allow FWS FWS FWS, while the original requires a comment between them.

Well, yes-ish. If you look at the definition of FWS, it matches any non-zero
amount of whitespace, as long as it has only one CRLF (I missed this 2nd
restriction when checking equivalence). The easiest way to fix this is to
separate the CRLF from other whitespace, and ensure that there is comment
text (potentially containing whitespace) on every line except the first and
last, or something like that. An exercise for the reader ;-).

I was happy to accept the no-left-recursion limitation of the grammar,
because it's easy to spot and the downside is infinite recursion.

Yes, that's not something that's actually happened to me, so I've never
even needed to spot it

. I ran into it big-time when I first used
TXL though!

But what you describe here seems far more insidious. This limitation
(which I assume is not in fact limited to PEG in general, but
specifically to the packrat memoization technique that makes this
solution perform reasonably)

Correct, I meant the packrat technique, not PEG.

says to me: There are some cases where
you can write something that looks correct and technically *is*
correct, but that will silently fail on otherwise valid content.

There's a reason why Nathan points out that among a set of alternates,
the *first* one that succeeds is taken. That means, even if that makes a
higher-level rule fail, it won't try subsequent alternatives. Yes, it's
a trap, but once you get into the right thought pattern, it's not that
hard to avoid. You basically need to consider where you might mis-read
the text yourself, and introduce negative lookahead that prevents the
mis-reading.

Oh, and use TDD

. You need that to ensure that whitespace is
skipped, or expected, everywhere it should, if for no other reason.
You just won't get it right with TT otherwise... but you won't with
any other tool either.

Can you provide any guidelines or reading points on how to recognize a
set of rules that may fail like this? Is it limited to use of the zero-
or-more star symbol? Particularly likely to occur when any star or
plus quantifier appears at the start of a rule?

Not just at the start of a rule, but anywhere in the rule. You need to
ensure that you insert a !wrong_path predicate to limit the repetition,
any time a repetition might chew too much. Likewise for alternation,
make sure you get the alternates in the right order, and dress any with
negative lookahead when an early choice might be wrong.

Thanks again for your help. I'm still excited about treetop, but less
so as I find that I actually have to know some theory about the
language I'm writing in (*gasp*) instead of being able to naively
throw together BNF-like rules and have it magically all work out.

Have a play with ANTLRworks, which will analyze things and tell you when
you have a problem, and when you find that ANTLR's other restrictions
begin to frustrate the crap out of you (as happened to me), come back to
using Treetop.

Does anyone else wish for the ability of Treetop to emit languages
other than Ruby? I'm probably going to need C# one day, and perhaps
Java.

Clifford Heath.

Clifford Heath · Jan 29, 2008

Phrogz said:
My apologies for second guessing you. Having read the PEG paper[1], I
see now what you're saying.

Well, though the authors of the packrat paper described a parser that
behaves that way, I don't think it's fair to limit the possible other
behaviours of any PEG parser generator. IOW, PEG >= packrat. ANTLR is
also LL(*), but avoids this problem by analyzing the expressions and
either rejecting or re-writing them.

So...how would you write a PEG rule to match "a string of any number
of vowels, that ends with an 'e'"?

Assuming some EOF token (not provided free by Treetop) you'd write:

(!('e' EOF) [aeiou])* 'e'

Of course, in many cases, EOF would be replaced by a white-space or
other rule.

Clifford Heath.

Vidar Hokstad · Jan 29, 2008

The only 'cheat' I did was to omit any of the branches of the grammar
that were marked as obsolete (and yet included?) in the spec.

The "obsolete" productions were not obsolete in RFC 822, and hence
there may still be
a significant number of clients out there _capable_ of emitting e-mail
addresses using
that syntax.

However, in practice, most clients generate a very limited subset of
RFC 822 and/or
RFC 2822 compliant addresses - it's likely safe to ignore the obsolete
rules. 8-9
years ago, when I used to run a webmail system with 1-2 million
accounts, we didn't
even bother trying to parse the full RFC as we never saw e-mail coming
in using the more
"exotic" features, such as nested comments or quoted local parts.

Vidar

Matt Mower · Jan 29, 2008

Does anyone else wish for the ability of Treetop to emit languages
other than Ruby? I'm probably going to need C# one day, and perhaps
Java.

Would I be alone in finding Objective-C support useful?

Regards,

Matt.

Phil Tomson · Jan 29, 2008

Would I be alone in finding Objective-C support useful?

An Objective-C backend for TreeTop would rock!

That said, looking at the output from tt it seems that it's very much
oriented towards Ruby's dynamicity - extend is used extensively, for
example.

Not saying an Objective-C backend would be impossible, though....

Phil

Clifford Heath · Jan 29, 2008

Phil said:
An Objective-C backend for TreeTop would rock!

Once the basic framework for multiple output languages was there,
it wouldn't be hard to add. At the moment there's a Ruby builder,
but the interface doesn't allow a simple substitution of another
builder. A fairly simple cleanup would fix that. I'd also like it
to emit more meaningful names, like "rulename_result" and
"rulename_sequence" instead of "r1" and "s0".

That said, looking at the output from tt it seems that it's very much
oriented towards Ruby's dynamicity - extend is used extensively, for
example.
Not saying an Objective-C backend would be impossible, though....

Not at all. The code blocks would have to be emitted into new classes
for Java. For C#, they'd be partial classes. I don't know Obj-C, but
I doubt it's harder than those two.

Clifford Heath.

Matt Mower · Jan 29, 2008

Not at all. The code blocks would have to be emitted into new classes
for Java. For C#, they'd be partial classes. I don't know Obj-C, but
I doubt it's harder than those two.

No, probably easier.

Once basic support for other languages is there I would be glad to get
involved in generating the Obj-C code.

I already have a use in mind for Treetop + Objective-C + Nu.

M.

Phil Tomson · Jan 30, 2008

Once the basic framework for multiple output languages was there,
it wouldn't be hard to add. At the moment there's a Ruby builder,
but the interface doesn't allow a simple substitution of another
builder. A fairly simple cleanup would fix that.

What do you think would be involved to create this framework for
multiple output langauages? Any design docs?

I'd also like it
to emit more meaningful names, like "rulename_result" and
"rulename_sequence" instead of "r1" and "s0".

Not at all. The code blocks would have to be emitted into new classes
for Java. For C#, they'd be partial classes. I don't know Obj-C, but
I doubt it's harder than those two.

Yes, it's probalby easier in Obj-C.

Phil

Clifford Heath · Jan 31, 2008

Phil said:
What do you think would be involved to create this framework for
multiple output langauages? Any design docs?

No design docs, but the Ruby text is built using a Ruby builder,
which makes it sound fine and ready to go... except that the API
to the builder allows arbitrary text strings to be appended, and
the API assumes modules and extend() etc... Basically the builder
API needs to be cleaned up to encapsulate the semantics and to not
assume Ruby. Then when doing the first *other* language, cleaned
up some more to catch the things you missed the first time.

Clifford Heath.

Phil Tomson · Feb 1, 2008

No design docs, but the Ruby text is built using a Ruby builder,
which makes it sound fine and ready to go... except that the API
to the builder allows arbitrary text strings to be appended, and
the API assumes modules and extend() etc... Basically the builder
API needs to be cleaned up to encapsulate the semantics and to not
assume Ruby. Then when doing the first *other* language, cleaned
up some more to catch the things you missed the first time.

Yeah, I noticed the Builder stuff. I did some changes
(experimentally) to my local version of TreeTop to add an
on_success_callback that gets called when a rule parses
successfully... turns out I didn't need that functionality just now,
but it did get me familiar with some of the inards.

Thinking out loud here... So let's aay somebody did a C++ backend. As
has been mentioned, the current builder makes use of Ruby's extend
functionality quite heavily - modules get built for each alternative
and then get mixed in as needed. To get the functionality in C++ I
suppose you could just build new classes for each rule alternative.

For example the following snippet from tt:

module TopLevelSav1
def get_ports
puts "unknown ports"
end
def port_decl?
false
end
def entity_decl?
puts "Not an entity decl!"
false
end
end

def _nt_top_level_sav
start_index = index
cached = node_cache[:top_level_sav][index]
if cached
@index = cached.interval.end
return cached
end

i0, nr0 = index, []
r1 = _nt_entity_decl
r1.extend(TopLevelSav0)
nr0 << r1
if r1.success?
r0 = r1
r1.update_nested_results(nr0)
else
s2, nr2, i2 = [], [], index
loop do
r3 = parse_anything(SyntaxNode)
nr2 << r3
if r3.success?
s2 << r3
else
break
end
end
r2 = AnyNode.new(input, i2...index, s2, nr2)
r2.extend(TopLevelSav1)
....

The second extend there seems fairly straighforward: when you generate
the AnyNode class, generate a subclass that includes the methods
defined in TopLevelSav1. The first extend seems a bit more
problematic. We get a SyntaxNode object back from the
_nt_entity_decl. We could define a class TopLevelSav0 that includes
the methods defined in that module and then pass the SyntaxNode in and
construct it like so:

class TopLevelSav1 : public SyntaxNode {
public:
vector< Port* > getports();
bool port_decl();
bool entity_decl();
TopLevelSav1(SyntaxNode* sn);
}

... later:

SytaxNode* r1;
...
r1 = _nt_entity_decl();
r1 = new TopLevelSav1(r1);

Or maybe better yet:

r1 = dynamic_cast<TopLevelSav1*>(_nt_entity_decl());

Thoughts?

Oh, BTW: right now we allow Ruby code in rules:

rule foo
'something' {
def got_something?
true
end
}
end

How would we go about this for other languages?

rule foo
'something' {
#C++ code:
bool got_something() { return true; } // but how would you
handle scoping here?
}
end

It seems to me that I'd like to prototype the grammar in Ruby and then
transfer it over to C++ (or ObjC ) later on after I'm sure everything
is working. How could we make this multi-language targetting easier?

Phil

regx expression? (link included)	2	Feb 2, 2008
Parsing, BNF, TreeTop, GhostWheel, ...	11	Aug 12, 2010
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
generate and send mail with python: tutorial	8	Aug 11, 2011
[SUMMARY] Parsing JSON (#155)	12	Feb 7, 2008
Unicode problem with Java Xerces DOM	2	Sep 28, 2004
[SUMMARY] Whiteout (#34)	5	Jun 9, 2005
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 15, 2008

Treetop Email Parser

Phrogz

Clifford Heath

Phrogz

Phrogz

Clifford Heath

Clifford Heath

Vidar Hokstad

Matt Mower

Phil Tomson

Clifford Heath

Matt Mower

Phil Tomson

Clifford Heath

Phil Tomson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads