V
vikkous
Yesterday, I introduced my the Ruby Extended Grammar, a pattern
matching library for ruby data. Astute readers may have noticed a
slight misnomer. Reg is not a grammar (parser), nor a tool for
grammars. It's really just a very fancy regular expression engine.
Regular expressions are equivalent to state machines. State machines
are not powerful enough by themselves to solve interesting parsing
problems -- that is, how to parse a language like ruby with infix
operators of different precedence and associativity. However, reg can
be easily augmented to be a real solve interesting parsing problems.
Handling precedence and associativity requires a lalr(1) parser. Let me
explain briefly the lalr algorithm:
The important lalr data structures are the stack and input. The input
is simply a stream of tokens fed into the parser, as it requests them.
The next token(s) waiting to be taken off the input is called the
lookahead.
The stack contains the results of partially parsed expressions. At each
step of the parse process, the parser decides (based on what's found at
the top of the stack and in the lookahead) whether to shift another
token off the input onto the stack or to reduce some of the tokens at
the top of the stack using the rules of the language's grammar. At the
end, we expect to see the input empty and on the stack a single token,
which represents the parse tree of the entire program.
Normal parsers (also called compiler compilers) use a big complicated
table to decide at runtime whether to shift or reduce and, if reducing,
which rule to reduce by. This table represents the compiled form of the
language grammar. That's why they're called compiler compilers. My
approach is rather different, and might best be described as an
interpreter interpreter. (Or, if it's to be used in a compiler, it
would be a compiler interpreter.)
Instead of shifting or choosing one rule to match at each step, each
rule is given a chance to match, and when none can, then the input is
shifted. Reg is used as the pattern matching engine, and a small
wrapper layer manages the parser data structures and invokes reg at
each step to do a match
attempt. I believe this approach is in general equivalent to the normal
lalr algorithm.
Yesterday's reg release contained a sketch of these ideas in the form
of a parser for a small, bc-like calculator language, in calc.reg.
I've also reproduced it below. Basically, it's a subset of ruby with
only local variables, numbers, a few operators (+, -, *, /, =, ,
parentheses, and p as the sole function. Although small, parsing this
language is a representative problem because it requires solving
precedence and associativity.
The heart of the parser are its grammar rules, reproduced here:
#last element is always lookahead
Reduce=
-[ -[, '(', exp, ')'].sub {PrintExp.new BR[2]}, OB ] |
# p(exp)
-[ -['(', exp, ')'] .sub {BR[1]}, OB ] |
# (exp)
-[ -[exp, leftop, exp] .sub {OpExp.new *BR[0..2]},
regproc{lowerop(BR[1])} ] | # exp+exp
-[ exp, -[';'] .sub [], :EOI ] |
#elide final trailing ;
-[ -[name, '=', exp] .sub {AssignExp.new BR[0],BR[2]}, lowerop('=')
] #name=exp
Precedence is handled by the middle rule. This rule reduces infix
operator expressions (except =). It only matches if the lookahead does
not contain a higher precedence operator. This ensures that expressions
like '3+4*5' will parse correctly.
Associativity is handled by the last rule. = is the only
right-associative operator, so it's the only one that has to be handled
specially. Again, it allows a reduce only if the lookahead is not also
right-associative (and lower precedence...). This ensures that
expressions like 'a=b=c' will parse correctly.
The great advantage of the interpreter interpreter is flexibility. It
would be quite easy to extend this parser -- even at runtime -- by
adding things at the right place in Reduce. The disadvantage is
performance, which is likely to be very bad currently. The current
implementation of reg is not optimized to any great extent. Many
regexp-type optimizations could be applied to reg. Optimized regexp
engines can actually be quite fast, so, (aside from performance issues
with ruby itself) an optimized reg might actually be competitive with a
table-based parser in terms of performance. Keep in mind that
table-based parsers are not actually the fastest; the gold standard are
hand-coded or direct execution parsers.
Error detection is an area that might be troublesome. I haven't given
this a lot of thought yet, but I think it's approachable, without
causing too much pain. One way might be to wait until a synchronizing
token, then report errors.
calc.reg:
require 'reg'
#warning: this code is untested
#currently, it will not work because it depends on
#features of reg which do not exist (backreferences
and substitutions). in addition,
#it is likely to contain serious bugs, as it has
#not been thoroughly tested or assured in any way.
#nevertheless, it should give you a good idea of
#how this sort of thing works.
precedence={
:'('=>10, =>10,
:* =>9, :/ =>9,
:+ =>8, :- =>8,
:'='=>7,
:';'=>6
}
name=String.reg
exp=name|PrintExp|OpExp|AssignExp|Number #definitions of the
expression classes ommitted for brevity
leftop=/^[*\/;+-]$/
rightop=/^=$/
op=leftop|rightop
def lowerop opname
regproc{
leftop & proceq(Symbol) {|v| precedence[opname] >= precedence[v] }
}
end
#last element is always lookahead
Reduce=
-[ -[, '(', exp, ')'].sub {PrintExp.new BR[2]}, OB ] |
# p(exp)
-[ -['(', exp, ')'] .sub {BR[1]}, OB ] |
# (exp)
-[ -[exp, leftop, exp] .sub {OpExp.new *BR[0..2]},
regproc{lowerop(BR[1])} ] | # exp+exp
-[ exp, -[';'] .sub [], :EOI ] |
#elide final trailing ;
-[ -[name, '=', exp] .sub {AssignExp.new BR[0],BR[2]}, lowerop('=')
] #name=exp
#last element of stack is always lookahead
def reduceloop(stack)
old_stack=stack
while stack.match +[OBS, Reduce]
end
stack.equal? old_stack or raise 'error'
end
#last element of stack is always lookahead
def parse(input)
input<<:EOI
stack=[input.shift]
until input.empty? and +[OB,:EOI]===stack
stack.push input.shift #shift
reduceloop stack
end
return stack.first
end
matching library for ruby data. Astute readers may have noticed a
slight misnomer. Reg is not a grammar (parser), nor a tool for
grammars. It's really just a very fancy regular expression engine.
Regular expressions are equivalent to state machines. State machines
are not powerful enough by themselves to solve interesting parsing
problems -- that is, how to parse a language like ruby with infix
operators of different precedence and associativity. However, reg can
be easily augmented to be a real solve interesting parsing problems.
Handling precedence and associativity requires a lalr(1) parser. Let me
explain briefly the lalr algorithm:
The important lalr data structures are the stack and input. The input
is simply a stream of tokens fed into the parser, as it requests them.
The next token(s) waiting to be taken off the input is called the
lookahead.
The stack contains the results of partially parsed expressions. At each
step of the parse process, the parser decides (based on what's found at
the top of the stack and in the lookahead) whether to shift another
token off the input onto the stack or to reduce some of the tokens at
the top of the stack using the rules of the language's grammar. At the
end, we expect to see the input empty and on the stack a single token,
which represents the parse tree of the entire program.
Normal parsers (also called compiler compilers) use a big complicated
table to decide at runtime whether to shift or reduce and, if reducing,
which rule to reduce by. This table represents the compiled form of the
language grammar. That's why they're called compiler compilers. My
approach is rather different, and might best be described as an
interpreter interpreter. (Or, if it's to be used in a compiler, it
would be a compiler interpreter.)
Instead of shifting or choosing one rule to match at each step, each
rule is given a chance to match, and when none can, then the input is
shifted. Reg is used as the pattern matching engine, and a small
wrapper layer manages the parser data structures and invokes reg at
each step to do a match
attempt. I believe this approach is in general equivalent to the normal
lalr algorithm.
Yesterday's reg release contained a sketch of these ideas in the form
of a parser for a small, bc-like calculator language, in calc.reg.
I've also reproduced it below. Basically, it's a subset of ruby with
only local variables, numbers, a few operators (+, -, *, /, =, ,
parentheses, and p as the sole function. Although small, parsing this
language is a representative problem because it requires solving
precedence and associativity.
The heart of the parser are its grammar rules, reproduced here:
#last element is always lookahead
Reduce=
-[ -[, '(', exp, ')'].sub {PrintExp.new BR[2]}, OB ] |
# p(exp)
-[ -['(', exp, ')'] .sub {BR[1]}, OB ] |
# (exp)
-[ -[exp, leftop, exp] .sub {OpExp.new *BR[0..2]},
regproc{lowerop(BR[1])} ] | # exp+exp
-[ exp, -[';'] .sub [], :EOI ] |
#elide final trailing ;
-[ -[name, '=', exp] .sub {AssignExp.new BR[0],BR[2]}, lowerop('=')
] #name=exp
Precedence is handled by the middle rule. This rule reduces infix
operator expressions (except =). It only matches if the lookahead does
not contain a higher precedence operator. This ensures that expressions
like '3+4*5' will parse correctly.
Associativity is handled by the last rule. = is the only
right-associative operator, so it's the only one that has to be handled
specially. Again, it allows a reduce only if the lookahead is not also
right-associative (and lower precedence...). This ensures that
expressions like 'a=b=c' will parse correctly.
The great advantage of the interpreter interpreter is flexibility. It
would be quite easy to extend this parser -- even at runtime -- by
adding things at the right place in Reduce. The disadvantage is
performance, which is likely to be very bad currently. The current
implementation of reg is not optimized to any great extent. Many
regexp-type optimizations could be applied to reg. Optimized regexp
engines can actually be quite fast, so, (aside from performance issues
with ruby itself) an optimized reg might actually be competitive with a
table-based parser in terms of performance. Keep in mind that
table-based parsers are not actually the fastest; the gold standard are
hand-coded or direct execution parsers.
Error detection is an area that might be troublesome. I haven't given
this a lot of thought yet, but I think it's approachable, without
causing too much pain. One way might be to wait until a synchronizing
token, then report errors.
calc.reg:
require 'reg'
#warning: this code is untested
#currently, it will not work because it depends on
#features of reg which do not exist (backreferences
and substitutions). in addition,
#it is likely to contain serious bugs, as it has
#not been thoroughly tested or assured in any way.
#nevertheless, it should give you a good idea of
#how this sort of thing works.
precedence={
:'('=>10, =>10,
:* =>9, :/ =>9,
:+ =>8, :- =>8,
:'='=>7,
:';'=>6
}
name=String.reg
exp=name|PrintExp|OpExp|AssignExp|Number #definitions of the
expression classes ommitted for brevity
leftop=/^[*\/;+-]$/
rightop=/^=$/
op=leftop|rightop
def lowerop opname
regproc{
leftop & proceq(Symbol) {|v| precedence[opname] >= precedence[v] }
}
end
#last element is always lookahead
Reduce=
-[ -[, '(', exp, ')'].sub {PrintExp.new BR[2]}, OB ] |
# p(exp)
-[ -['(', exp, ')'] .sub {BR[1]}, OB ] |
# (exp)
-[ -[exp, leftop, exp] .sub {OpExp.new *BR[0..2]},
regproc{lowerop(BR[1])} ] | # exp+exp
-[ exp, -[';'] .sub [], :EOI ] |
#elide final trailing ;
-[ -[name, '=', exp] .sub {AssignExp.new BR[0],BR[2]}, lowerop('=')
] #name=exp
#last element of stack is always lookahead
def reduceloop(stack)
old_stack=stack
while stack.match +[OBS, Reduce]
end
stack.equal? old_stack or raise 'error'
end
#last element of stack is always lookahead
def parse(input)
input<<:EOI
stack=[input.shift]
until input.empty? and +[OB,:EOI]===stack
stack.push input.shift #shift
reduceloop stack
end
return stack.first
end