announcing RubyLexer 0.6.0

vikkous · Apr 25, 2005

Advisory tokens (which would tell me that I am now entering

the condition of if and now leaving it and now entering the
action part of it and so on) might do this.

So you want to match the 'then' with it's owning 'if'? That's not
something I've had to do yet, but it shouldn't be hard... How's this
for an interface:
I can add a new method to the Token class, let's call it match_id for
now. Every time there's a token like 'if', '(', 'begin', that starts a
nested context, the match_id of that token will be set to a unique
value. When the corresponding 'end' or ')' comes along, it will have a
match_id with the same value as the corresponding context opening
token. We can easily have 'then' with a match_id corresponding to its
'if' as well. This should make it pretty easy to put the pieces
together again afterward.

Hmm... but there are tokens besides 'then' that can serve the same
syntactical role: ':', ';', and newline in this case. So the same thing
would have to happen with them, I guess. Do you want to know things
like, this colon is standing in place of a then? What sorts of thing
besides 'then' do you want to match to their owners?

There are complications for incremental lexing too, which isn't
something I do now, but I want to. Let me think a little about this.
You might be getting these features in a subclass of RubyLexer.

Heh. I just realized that strings now work the way you wanted
originally, but I'm going to break that in a future version to be the
way I want it.

In the past I have frequently had trouble
with the distinction of lexing and parsing in real language
parsing -- most languages require you to keep some context
for actually tokenizing them. Ruby, for example, requires that
your lexer knows about all kinds of quoted Strings and where
they end and interpolated expressions inside them.

You can say that again. The amount of extra (non-lexical, strictly
speaking) work to get RubyLexer working was phenomenal. You wouldn't
believe all the squirrelly little cases. It makes the language easy to
use, but hard to process programatically. Given the choice, I'd like to
find a different way next time. If there could be one tool that does
both at once... I don't know what that would look like. Reg might be
able to do both, but in separate stages.

Nope, not really. I've just used it out of IRB. Integrating it
ought to be possible, but I'm not sure why that would be
necessary.

It's necessary because I want to. Because irb's lexer is sometimes
wrong, and freaks like me who use irb to explore the syntax get fooled
sometimes. Because irb could use it to colorize input and output.
(Maybe it's current lexer would serve for the last purpose...)

Good luck.

I got a little way through it... aside from the unique use of
whitespace, my big problem so far is handling the dos-style newlines. I
handle common cases of it now, but pre is anything but common. Are you
a windows person, or did you do that just to be more deviant and make
my life difficult?

vikkous · Apr 25, 2005

Peter said:
I am currently constructing an LALR parser for Ruby using
RubyLexer for the Alumina-VM project. I suspect that
RubyLexer is going to make this much cleaner.

Please see my post titled, "Lalr(n) parsing with reg". Peter's taking
the traditional approach; I've got my own weird ideas that I want to
try.

Florian Groß · Apr 26, 2005

vikkous said:
So you want to match the 'then' with it's owning 'if'? That's not
something I've had to do yet, but it shouldn't be hard... How's this
for an interface:
I can add a new method to the Token class, let's call it match_id for
now. Every time there's a token like 'if', '(', 'begin', that starts a
nested context, the match_id of that token will be set to a unique
value. When the corresponding 'end' or ')' comes along, it will have a
match_id with the same value as the corresponding context opening
token. We can easily have 'then' with a match_id corresponding to its
'if' as well. This should make it pretty easy to put the pieces
together again afterward.

It is not so important to match the then to the if to me -- it is just
important to get the part that comes between the if and the matching
'then', ':', ';' or newline. I'm not sure if you even need to do it as
you described -- I thought having a special mode / sub-class lexer which
emits contextual tokens that are no real tokens would already do this
fairly well while also being reasonably simple. So

if condition then action end

would produce a token stream similar to

# pardon me if my way of representing this is not at all compatible
# with RubyLexer's design -- I need to get familiar with it soon
[KeyWord['if'], IfConditionStart, VariableOrMethod['condition'],
IfConditionEnd, KeyWord['then'], IfActionStart,
VariableOrMethod['action'], IfActionEnd, KeyWord['end']]

And I think that that would be easier to analyze than the non-annotated
token stream. Of course you would still have to do nesting counting to
be able to extract the sections, but I think that would be reasonable
for simplicity's sake.

You can say that again. The amount of extra (non-lexical, strictly
speaking) work to get RubyLexer working was phenomenal. You wouldn't
believe all the squirrelly little cases. It makes the language easy to
use, but hard to process programatically. Given the choice, I'd like to
find a different way next time. If there could be one tool that does
both at once... I don't know what that would look like. Reg might be
able to do both, but in separate stages.

Hm, why is that? Could it not use the rules it uses for parsing for
one-token-at-a-time-ahead lexing?

I'm not sure whether not having lexing and parsing more unified has
benefits or downsides with your approach. I guess I will just have to
write a Joy interpreter using all this. Do you think that that can
already be done or is there features missing that would make it wise to
delay this further?

[Integrating the lexer with IRB]
It's necessary because I want to. Because irb's lexer is sometimes
wrong, and freaks like me who use irb to explore the syntax get fooled
sometimes. Because irb could use it to colorize input and output.
(Maybe it's current lexer would serve for the last purpose...)

Heh, you must have been reading old postings of mine. IRB doing syntax
highlighting as you type has been on my wish list for a while.

That aside, I think I misunderstood you. I originally thought you wanted
to integrate IRB's lexer with your tool chain, but it appears that you
want to instead integrate your lexer with IRB.

I think such things are possible fairly easily with Ruby -- after all
you just have to emulate the method interfaces of the part you want to
replace and swap it out.

I have done similar things with ruby-breakpoint where I overwrite parts
of IRB so that it can be split into a client and a server. The server
part does not use STDIN/STDOUT which means I can then use IRB for
debugging CGI applications and pretty much everything else as well.

[pre.rb]
I got a little way through it... aside from the unique use of
whitespace, my big problem so far is handling the dos-style newlines. I
handle common cases of it now, but pre is anything but common. Are you
a windows person, or did you do that just to be more deviant and make
my life difficult?

Heh, I'm really one of them Windows users and mostly happy so far though
I think I would not object against a free switch to Mac OS X if the
opportunity ever turned up.

Had I wanted to make this yet more difficult I would have mixed multiple
styles of newlines.

Now I actually do wonder if using CRLF instead of LF does anything
special to newline-delimited literals on any platforms.

vikkous · Apr 26, 2005

would produce a token stream similar to

# pardon me if my way of representing this is not at all compatible
# with RubyLexer's design -- I need to get familiar with it soon
[KeyWord['if'], IfConditionStart, VariableOrMethod['condition'],
IfConditionEnd, KeyWord['then'], IfActionStart,
VariableOrMethod['action'], IfActionEnd, KeyWord['end']]

And I think that that would be easier to analyze than the non-
annotated token stream. Of course you would still have to do
nesting counting to be able to extract the sections, but I think
that would be reasonable for simplicity's sake.

Ok, fair enough. Maybe this way is easier after all.

Hm, why is that? Could it not use the rules it uses for parsing
for one-token-at-a-time-ahead lexing?

I just can't see this. The lexer rules' input is the source file, but
the parser's is the parse stack -- which comes from the lexer's output
ultimately.... this can be a very powerful way to compose pattern
matchers, but in the end different rule sets are used with 2 different
inputs.

The lexer and parser can run interleaved, and the lexer can get
information from the parser to help interpret things (this is sometimes
called "cheating", but it isn't; it's often the easiest way). But
there's still the two rule sets. I don't know if it's possible to have
1 rule set do both at once, but the idea is intruiging.

I'm not sure whether not having lexing and parsing more
unified has benefits or downsides with your approach. I
guess I will just have to write a Joy interpreter using all this.
Do you think that that can already be done or is there
features missing that would make it wise to delay this further?

I took a little look a joy. Hoo-boy. I'm guessing this language is
pretty easy to parse. I would say reg is not ready for anything
significant until it has backreferences and substitutions. At that
point, it's got match-and-replace, and retrieval of arbitrary match
subexpressions. If you think you can live without those, I'd say go for
it. There are some problems with the backtracking engine, but so far as
I can see, only a whole lot of ambiguity causes the problems, so it's
_probably_ ok for most things.

Had I wanted to make this yet more difficult I would have
mixed multiple styles of newlines.

Now I actually do wonder if using CRLF instead of LF does
anything special to newline-delimited literals on any
platforms.

Sure enough, I translated to unix format and the problems disappeared.
Using a dos newline as a
delimiter in a fancy string is just a little difficult for me because I
had always assumed string delimiters were a single character... hrm.
Here documents need this functionality to really support dos newlines
correctly too, I think.

Florian Groß · Apr 26, 2005

vikkous said:
I just can't see this. The lexer rules' input is the source file, but
the parser's is the parse stack -- which comes from the lexer's output
ultimately.... this can be a very powerful way to compose pattern
matchers, but in the end different rule sets are used with 2 different
inputs.

The lexer and parser can run interleaved, and the lexer can get
information from the parser to help interpret things (this is sometimes
called "cheating", but it isn't; it's often the easiest way). But
there's still the two rule sets. I don't know if it's possible to have
1 rule set do both at once, but the idea is intruiging.

Hm, this might be related to me thinking pretty much in Regexps as that
has turned out to be quite simple. Is it not possible to apply your
extended expressions to Strings? Perhaps by .scan(/./)?

I took a little look a joy. Hoo-boy. I'm guessing this language is
pretty easy to parse. I would say reg is not ready for anything
significant until it has backreferences and substitutions. At that
point, it's got match-and-replace, and retrieval of arbitrary match
subexpressions. If you think you can live without those, I'd say go for
it. There are some problems with the backtracking engine, but so far as
I can see, only a whole lot of ambiguity causes the problems, so it's
_probably_ ok for most things.

Yup, it ought to be relatively simple to parse, though I still don't
like lexing it as you don't want to handle spaces specially in Strings
and so on.

I'm not even sure if I will need non-trivial backtracking or
substitutions which is probably a sign I will need them.

Sure enough, I translated to unix format and the problems disappeared.

Was this the only problem? I think that my usage of here-docs might turn
out to be quite exotic as well.

[ANN] RubyLexer 0.7.7 Released	0	Jan 4, 2010
[ANN] rubylexer 0.7.6 Released	0	Jul 7, 2009
[ANN] RubyLexer 0.7.4 Released	0	May 22, 2009
[ANN] rubylexer 0.7.0 Released	0	Feb 21, 2008
[ANN] rubylexer 0.7.3 Released	2	May 1, 2009
[ANN] rubylexer 0.7.1 Released	1	Sep 2, 2008
[Ann] RubyLexer 0.6.2	0	Jun 2, 2005
[ANN] RedParse 0.8.0 released	0	Oct 23, 2008

announcing RubyLexer 0.6.0

vikkous

vikkous

Florian Groß

vikkous

Florian Groß

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads