Creating a new Syntax tokenizer

G

Gavin Kistner

I want to write my own wiki markup language. Pure regexp fails me, as
I need a proper parser to keep track of state.
I thought I'd give Syntax a try, but I'm a little confused as to some
of the specifics.

1) What is a 'region', and how do I use the start_region method? It's
not documented in the API, or the source. (I think this is what I
want for nesting tags.)

2) Do I have to close_group and close_region, or do they
automatically get invoked under certain circumstances? (Does starting
one group close the previous one? Do repeated calls to open the same
group cause them to be aggregated together (is that how accumulating
text in :normal groups works?)

3) How do I keep track of state during successive calls to #step? I
tried an instance variable, but that doesn't seem to exist across calls.

Following is my terrible, broken attempt at the basics of what I'm
after. Am I totally misunderstanding how to use Syntax?


require 'rubygems'
require_gem 'syntax'

class OWLScribble < Syntax::Tokenizer
def step
if heading = scan( /^={1,6}/ )
start_region "heading level #{heading.length}".intern
$heading_end = Regexp.new( heading + "\\s*" )
elsif $heading_end && ( heading = scan( $heading_end ) )
end_region "heading level #{heading.length}".intern
$heading_end = nil
elsif char = scan( /^[\r\n]/ )
start_group :paragraph, char
elsif scan( /\*\*/ )
if $inbold
end_region :bold
$inbold = nil
else
start_region :bold
$inbold = true
end
elsif char = scan( /./ )
start_group :normal, char
else
scan( /[\r\n]/ )
end
end
end

Syntax::SYNTAX[ 'owlscribble' ] = OWLScribble

str = <<END
Intro paragraph

= Heading 1 =
First **paragraph** under the heading.

== Second **Heading** = very yes ==
Another paragraph.
END

tokenizer = Syntax.load( "owlscribble" )
tokenizer.tokenize( str ) do |token|
puts "#{token.group} (#{token.instruction}) #{token}"
end
 
J

Jamis Buck

I want to write my own wiki markup language. Pure regexp fails me,
as I need a proper parser to keep track of state.
I thought I'd give Syntax a try, but I'm a little confused as to
some of the specifics.

1) What is a 'region', and how do I use the start_region method?
It's not documented in the API, or the source. (I think this is
what I want for nesting tags.)

Regions are groups that can contain other groups nested within them.
Syntax's Ruby tokenizer uses regions to do syntax highlighting of
strings, and interpolated expressions, for example.

start_region is used identically to start_group--you give it the name
of the group you want to start (or continue, if that group is already
open), and an optional string to get things started. (The string
becomes the starter text for the group.)
2) Do I have to close_group and close_region, or do they
automatically get invoked under certain circumstances? (Does
starting one group close the previous one? Do repeated calls to
open the same group cause them to be aggregated together (is that
how accumulating text in :normal groups works?)

close_group is automatically called when you start a new group.
close_region is never automatically called, because regions can be
nested, so unless you have a region that you want to persist to the
end of your document, you need to explicitly call it at some point.

Multiple calls of start_group with the same group name do, indeed,
get concatenated together into a single group.
3) How do I keep track of state during successive calls to #step? I
tried an instance variable, but that doesn't seem to exist across
calls.

Instance variables should work--I use them successfully in the Ruby
tokenizer, for instance. Feel free to contact me off-list and I can
help troubleshoot this if it isn't working for you.
Following is my terrible, broken attempt at the basics of what I'm
after. Am I totally misunderstanding how to use Syntax?

Without actually trying to run it, I'd say you've got the idea. This
is an interesting use of Syntax--given that Syntax was intended for
use as a highlighter, I wouldn't have thought to use it as a more
general purpose parser, but it can definitely be used for that.
Clever! :)

- Jamis
 
G

Gavin Kistner

3) How do I keep track of state during successive calls to #step?
Instance variables should work--I use them successfully in the Ruby
tokenizer, for instance. Feel free to contact me off-list and I can
help troubleshoot this if it isn't working for you.

Hrm, I'll have to try it again sometime. Perhaps I screwed up when I
tried it.
Without actually trying to run it, I'd say you've got the idea.
This is an interesting use of Syntax--given that Syntax was
intended for use as a highlighter, I wouldn't have thought to use
it as a more general purpose parser, but it can definitely be used
for that. Clever! :)

Thanks for the help in your response. As noted in my OWLScratch post,
after thinking about the problem more and more, I decided that I
wanted a more state-based solution than Syntax seemed to surround,
but I want to thank you very much for the library, as the concepts in
it really helped me in my thinking (and introduced me to
StringScanner - what a gem!).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,174
Messages
2,570,940
Members
47,486
Latest member
websterztechnologies01

Latest Threads

Top