Writing a parser

M

Martin Hansen

Hello there,

I am struggling to get a grip on OO (and Ruby) and write a parser for
this particular data format described at the below get_record method.
The goal is to parse and emit records of this type and be able to
manipulate these with the methods in class Hash.

For now I am trying to get parsing to work, and the problem here is,
that the add methods is not defined for Hash. What is the best way of
doing that?



M




class Record < Hash
SEPERATOR = "\n---\n"

def initialize(path=nil)
@path = path
@ios = File.open(@path, 'r') if @path
end

# Reads the next 'record' from the I/O stream. The records are
separated by
# lines of "---" and each record consists of a varying number of lines
with a
# key/value pair separated by ": ". The record is returned as a hash.
def get_record
block = @ios.gets(SEPERATOR).chomp!(SEPERATOR)

record = {}
block.each_line do |line|
fields = line.split(": ")
raise ArgumentError, "Bad record line: #{line}" if fields.length
!= 2

record.add(fields[0],fields[1])
end

record
end

# Add a key value pair to this record.
def add(key, value)
raise ArgumentError, "Bad key contains ': ' or '\n' ->#{key}" if
key.match(": |\n")
raise ArgumentError, "Bad value contains ': ' or '\n' ->#{value}" if
value.match(": |\n")

self[key] = value
end

# Return the content of this record as a string.
def to_s
return nil if empty?
self.map { | key, val | "#{ key }: #{ val }" }.join( "\n" ) +
SEPERATOR
end
end
 
A

Aldric Giacomoni

Martin said:
Hello there,

I am struggling to get a grip on OO (and Ruby) and write a parser for
this particular data format described at the below get_record method.
The goal is to parse and emit records of this type and be able to
manipulate these with the methods in class Hash.

For now I am trying to get parsing to work, and the problem here is,
that the add methods is not defined for Hash. What is the best way of
doing that?

record = {}
block.each_line do |line|
fields = line.split(": ")
raise ArgumentError, "Bad record line: #{line}" if fields.length
!= 2

record.add(fields[0],fields[1])
end

irb(main):001:0> record = {}
=> {}
irb(main):002:0> record[:name] = "Doe, John"
=> "Doe, John"
irb(main):003:0> record[:eye_color] = "Blue"
=> "Blue"
irb(main):004:0> record[:age] = 5
=> 5
irb(main):005:0> record
=> {:age=>5, :name=>"Doe, John", :eye_color=>"Blue"}
 
M

Martin Hansen

Ah, yes, that is how a hash works :). Now I am interested in getting my
add method to work because it does some consistency checking on the key
and value. However, my add method is defined the wrong way/place.
 
A

Aldric Giacomoni

Martin said:
Ah, yes, that is how a hash works :). Now I am interested in getting my
add method to work because it does some consistency checking on the key
and value. However, my add method is defined the wrong way/place.

class Hash
def add
# your code here
end
end

But monkey-patching can be a Bad Thing(tm) so be careful. I would
recommend doing your checks in your code, or maybe in a separate
function.
Or, you could even create a real Record object instead of just a hash!

class Record < Hash
# studd
end
 
M

Martin Hansen

Ah, of cause I am creating record of class Hash and not class Record.
This does not Err (I wonder if it is sane though):

class Record < Hash
SEPERATOR = "\n---\n"

include Enumerable

def initialize(path=nil)
@path = path
@ios = File.open(@path, 'r') if @path
end

# Reads the next 'record' from the I/O stream. The records are
separated by
# lines of "---" and each record consists of a varying number of lines
with
# a key/value pair separated by ": ". The record is returned as a
hash.
def get_record
block = @ios.gets(SEPERATOR).chomp!(SEPERATOR)

record = Record.new

block.each_line do |line|
fields = line.split(": ")
raise ArgumentError, "Bad record line: #{line}" if fields.length
!= 2

record.add(fields[0],fields[1])
end

record
end

# Add a key value pair to this record.
def add(key, value)
raise ArgumentError, "Bad key contains ': ' or '\n' ->#{key}" if
key.match(": |\n")
raise ArgumentError, "Bad value contains ': ' or '\n' ->#{value}" if
value.match(": |\n")

self[key] = value
end

# Return the content of this record as a string.
def to_s
return nil if empty?
self.map { | key, val | "#{ key }: #{ val }" }.join( "\n" ) +
SEPERATOR
end
end
 
R

Robert Klemme

2010/4/16 Martin Hansen said:
Ah, of cause I am creating record of class Hash and not class Record.
This does not Err (I wonder if it is sane though):

class Record < Hash
=A0SEPERATOR =3D "\n---\n"

Inheriting from core classes is rarely a good idea. Basically your
records do not need to be more than Hashes - you just need a proper
implementation of fetching records.

class RecordSource
include Enumerable

def self.foreach(file, &b)
if block_given?
File.open(file) do |io|
rs =3D new(io)
rs.each_record(&b)
end
else
Enumerator.new(self, :foreach, file)
end
end

def self.open(file)
File.open(file) do |io|
yield new(io)
end
end

def initialize(io)
@io =3D io
end

def each_record
curr =3D {}

@io.each_line do |line|
case line
when /^([^:]+):(.*)$/
curr[$1] =3D $2
when /^---$/
yield curr unless curr.empty?
curr =3D {}
else
raise "Illegal format: #{line}"
end
end

yield curr unless curr.empty?

self # conventionally
end

alias :each :each_record
end

# usage

RecordSource.open "myfile" do |rs|
rs.each_record do |rec|
p rec
end
end

RecordSource.foreach("myfile") do |rec|
p rec
end

p RecordSource.foreach("myfile").to_a


Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
M

Martin Hansen

Robert,

This got pretty advanced fast :/. Multiple things are going on at the
same time - and I have grown up learning that a piece of code should
only do one thing at a time - only one! I think, I need to back up a
step or two to understand what is going on. Perhaps you could enlighten
me on how you addressed this problem strategically?

And how would the appropriate Unit Tests for this thing look like?

Cheers,


M
 
R

Robert Klemme

This got pretty advanced fast :/. Multiple things are going on at the
same time - and I have grown up learning that a piece of code should
only do one thing at a time - only one!

Hey, that's a good thing to learn early! :)
I think, I need to back up a
step or two to understand what is going on. Perhaps you could enlighten
me on how you addressed this problem strategically?

Well, basically I oriented myself on the way IO (and File) and similar
classes work (for example CSV): they iterate all elements which are
individually handed over to a block. That's the interface part.
Basically this is also the contract of Enumerable; #each hands off every
item to the block provided.

I separated the file opening from the iteration because you may want to
iterate other types of things that respond to #each_line (ARGF comes to
mind, but String and StringIO as well). This gives you the flexibility
to pull the data from wherever you like.

Parsing itself is pretty straightforward I'd say; at least I am using
the idiom frequently. You read line by line and decide what kind of
line you see and what you need to do with it. Regular expressions are
nice for this combined with a "case".

Finally, I aliased #each to #each_record so including Enumerable makes
sense (Enumerable relies on #each being present). That gives you all
the nice methods like #select, #find etc. for free. And I added the
convenience method #foreach to the class so you can use that IO#foreach
/ File#foreach.

Downside of this approach is of course that you cannot fetch records
individually like in your code where you have a method that fetches only
the next record.
And how would the appropriate Unit Tests for this thing look like?

You could create a IO mock that returns lines as that of valid (and
invalid!) records and verify that parsing returns exactly those records
that you expect. Of course you need various tests for valid and invalid
data etc.

Kind regards

robert
 
R

Ryan Davis

# Reads the next 'record' from the I/O stream. The records are
separated by
# lines of "---" and each record consists of a varying number of = lines
with a
# key/value pair separated by ": ". The record is returned as a hash.
def get_record

I'm not sure why nobody bothered to point this out, or maybe I'm missing =
something...
 
M

Martin Hansen

@Ryan,


This is not YAML. YAML allows arbitrarily complex data structures, and
the separator comes before the data (for some stupid reason). One could
possibly parse this data with a YAML parser, but my experience with the
YAML-syck parser (from Perl), which is written in C, is that it is much
slower than a simple parser. And I want speed.




@Robert,

I am still digesting :eek:) ...


M
 
R

Ryan Davis

This is not YAML. YAML allows arbitrarily complex data structures, and=20=
the separator comes before the data (for some stupid reason). One = could=20
possibly parse this data with a YAML parser, but my experience with = the=20
YAML-syck parser (from Perl), which is written in C, is that it is = much=20
slower than a simple parser. And I want speed.

And you chose ruby? Ruby isn't exactly known for speed. And I can =
practically guarantee you that the syck parser will be faster for hashes =
of strings than your parser (as it is currently written and several =
minor iterations out).

But no, what you described IS yaml, it is just a subset of yaml. You can =
enforce the subset-ness if you feel like it, but I'm not a big fan of =
static typing, so I don't recommend it.

At the very least, you should look at StringScanner and a full rewrite, =
drop the silly Record class entirely and build up proper hashes of =
strings. And don't leave those dangling File objects everywhere. Read =
the whole file or use the block form of file.open.
 
M

Martin Hansen

And you chose ruby? Ruby isn't exactly known for speed. And I can
practically guarantee you that the syck parser will be faster for hashes
of strings than your parser (as it is currently written and several
minor iterations out).

But no, what you described IS yaml, it is just a subset of yaml. You can
enforce the subset-ness if you feel like it, but I'm not a big fan of
static typing, so I don't recommend it.

Yes, for I want to use Ruby to juggle the records - and then let's see
about the speed :eek:). I know from Perl that there is a quite a difference
between a home rolled parser and YAML-syck (I never saw any subset-ness
features).

Also, I am learning here. This OO thinking is very exotic ...
At the very least, you should look at StringScanner and a full rewrite,
drop the silly Record class entirely and build up proper hashes of
strings. And don't leave those dangling File objects everywhere. Read
the whole file or use the block form of file.open.

I will gladly consider a rewrite - I am afraid that I find Robert's
contribution confusing (am I unfair?). I read about StringScanner in the
Cookbook IIRC, but I better explore further. Why is the Record class
silly?


M
 
R

Robert Klemme

2010/4/17 Martin Hansen said:
Yes, for I want to use Ruby to juggle the records - and then let's see
about the speed :eek:). I know from Perl that there is a quite a difference
between a home rolled parser and YAML-syck (I never saw any subset-ness
features).

Also, I am learning here. This OO thinking is very exotic ...

Yeah, it took me a while as well to grok it initially. I think it's
worthwhile, once simply because OO is mainstream nowadays but even
more importantly because it is a very powerful concept which lends
itself very well to modeling problems - especially in a language with
a clean and nice syntax as Ruby.
I will gladly consider a rewrite - I am afraid that I find Robert's
contribution confusing (am I unfair?).

You have every right to be confused. If you continue the thread at
the other end and point out which of my attempts to explain what's
going on caused you trouble, I can try to refine the explanation and
help clear things up.
Why is the Record class silly?

Not sure which remark you are referring to. I said that inheriting
core classes is rarely a good idea and if Hash functionality is all
you need for your records then you can simply use a Hash and just need
some other place to put the parsing / reading functionality.

Kind regards

robert
 
M

Martin Hansen

@Robert,

I was wondering about the whole approach, keeping in mind what Ryan
suggested (drop class Record which apparently is silly, and loose
dangling IO stuff). It strikes me that the simplest way to solve this is
to create an external iterator as a mixin (O'Reilly, Ruby Programming
Language, p 138). This is pretty much what you already suggested,
without the foreach method, which I don't need.

module Record
include Enumerable

def each
loop { yield self.next }
end
end
 
R

Robert Klemme

2010/4/19 Martin Hansen said:
@Robert,

I was wondering about the whole approach,

Erm, that's a rather broad statement. My approach had two parts:

a) Implementation of parsing which I changed from using gets() with a
delimiter because I found that approach too fragile (think of files on
different platforms with different line delimiters, in this case the
defined delimiter does not always work).

b) Embedding that in code that follows usual Ruby iterator patterns
and integrates seamless with Enumerable.
keeping in mind what Ryan
suggested (drop class Record which apparently is silly, and loose
dangling IO stuff). It strikes me that the simplest way to solve this is
to create an external iterator as a mixin (O'Reilly, Ruby Programming
Language, p 138). This is pretty much what you already suggested,
without the foreach method, which I don't need.

module Record
=A0include Enumerable

=A0def each
=A0 =A0loop { yield self.next }
=A0end
end

This implementation has flaws:

1. The loop does not terminate properly.

2. It does not return "self" which is convention for #each implementations.

3. Worst: there is a design issue here. An #each method of a Record
is expected to iterate through items _within_ (or contained in) the
Record. But what you are actually attempting here is to have
something that during iteration returns records (regardless of their
type). Since your method #each does not have arguments member
variables are needed to determine where to fetch records from. But
since you call this module "Record" you have a weird situation that
you need to create something Recordish (i.e. something with record
state) which also has non record state (for example an IO object) to
start returning other records from somewhere else. This is highly
confusing and hence something I would not let pass if I were doing the
code review on your code. :)

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
M

Martin Hansen

a) Implementation of parsing which I changed from using gets() with a
delimiter because I found that approach too fragile (think of files on
different platforms with different line delimiters, in this case the
defined delimiter does not always work).

I admit that you implementation will be more robust and more efficient.
However, this parser is meant to be run in a *NIX command line
environment so I can connect scripts with pipes (I should perhaps have
stated this in the "specs"). Actually, I have some more text on this
whole affair:

www.biopieces.org

and

http://code.google.com/p/biopieces/wiki/Introduction#Getting_started

b) Embedding that in code that follows usual Ruby iterator patterns
and integrates seamless with Enumerable.


This implementation has flaws:

1. The loop does not terminate properly.


But I think it does?!? From the Ruby book:

"External iterators are quite simple to use: just call next each time
you want another element. When there are no more elements left, next
will raise a StopIteration excep- tion. This may seem unusual—an
exception is raised for an expected termination condition rather than an
unexpected and exceptional event. (StopIteration is a de- scendant of
StandardError and IndexError; note that it is one of the only exception
classes that does not have the word “error†in its name.) Ruby follows
Python in this external iteration technique. By treating loop
termination as an exception, it makes your looping logic extremely
simple; there is no need to check the return value of next for a special
end-of-iteration value, and there is no need to call some kind of next?
predicate before calling next."

Not that I have been able to get the syntax right and make this work ...

2. It does not return "self" which is convention for #each
implementations.

I have heard of this convention before, but I seem to have missed it in
the Ruby book. I shall see if I can find it.
3. Worst: there is a design issue here. An #each method of a Record
is expected to iterate through items _within_ (or contained in) the
Record. But what you are actually attempting here is to have
something that during iteration returns records (regardless of their
type). Since your method #each does not have arguments member
variables are needed to determine where to fetch records from. But
since you call this module "Record" you have a weird situation that
you need to create something Recordish (i.e. something with record
state) which also has non record state (for example an IO object) to
start returning other records from somewhere else. This is highly
confusing and hence something I would not let pass if I were doing the
code review on your code. :)

I was going to forget about the "silly" class Record and stick to hashes
of strings. So modifying the behavior of each to something like this:

include 'my_record'

File.open('my_file', 'r').each do |record|
puts record.class # => Hash
end

Or is this completely nuts?




Thanks,


M
 
R

Robert Klemme

2010/4/20 Martin Hansen said:
I admit that you implementation will be more robust and more efficient.
However, this parser is meant to be run in a *NIX command line
environment so I can connect scripts with pipes (I should perhaps have
stated this in the "specs").

Well, there are also pipes on Windows. And even if your program is
run on a POSIX system the file might have been created on a Windows
box.

It does not matter that much whether you read from a pipe or file. If
you create something that only needs #each_line like I did you can
pass a wide range of sources - including ARGF which will happily read
from stdin if you do not provide file name arguments to your script.
Actually, I have some more text on this
whole affair:

www.biopieces.org

and

http://code.google.com/p/biopieces/wiki/Introduction#Getting_started




But I think it does?!?

Did you try it out?
From the Ruby book:

Which book?
"External iterators are quite simple to use: just call next each time
you want another element. When there are no more elements left, next
will raise a StopIteration excep- tion. This may seem unusual=E2=80=94an
exception is raised for an expected termination condition rather than an
unexpected and exceptional event. (StopIteration is a de- scendant of
StandardError and IndexError; note that it is one of the only exception
classes that does not have the word =E2=80=9Cerror=E2=80=9D in its name.)= Ruby follows
Python in this external iteration technique. By treating loop
termination as an exception, it makes your looping logic extremely
simple; there is no need to check the return value of next for a special
end-of-iteration value, and there is no need to call some kind of next?
predicate before calling next."

Frankly, I dissent that logic. First, it uses exceptions for
*regular* control flow. This is bad in itself (I am pretty sure
you'll find plenty of articles on the web about this topic if you look
for it). Then, it does not even follow conventions implemented in the
Ruby standard library (for example, IO#gets returns nil on EOF).
Not that I have been able to get the syntax right and make this work ...



I have heard of this convention before, but I seem to have missed it in
the Ruby book. I shall see if I can find it.


I was going to forget about the "silly" class Record and stick to hashes
of strings. So modifying the behavior of each to something like this:

include 'my_record'

File.open('my_file', 'r').each do |record|
=C2=A0puts record.class # =3D> Hash
end

Or is this completely nuts?

It does not work. File.open returns a File (or IO) instance.
Invoking #each on this will fill "record" with lines - not records.
Your line "record.class" will print "String" and not "Hash".

But I see that basically you want what I provided: a mechanism to
iterate records with #each.

At the moment I am not sure where we stand. You seem to insist that
something with my implementation is wrong and external iteration is
better - maybe because you have read about it in some book.
Realistically the need for external iteration rarely surfaces, mostly
it is if you need to iterate two or more collections in lock step. In
all other cases using #each and like mechanisms is superior because it
is less error prone.

Cheers

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
M

Martin Hansen

At the moment I am not sure where we stand. You seem to insist that
something with my implementation is wrong and external iteration is
better - maybe because you have read about it in some book.
Realistically the need for external iteration rarely surfaces, mostly
it is if you need to iterate two or more collections in lock step. In
all other cases using #each and like mechanisms is superior because it
is less error prone.

Robert, your help is much appreciated. I have been researching and
trying to get a grasp at the topics at hand: mixins, enumerable, etc -
and trying to keep an open mind to alternatives. Todays task is to get
your suggestion up and running and play with it by writing a bunch of
unit tests (a great way to learn btw). Me needs to become friends with
stubs (stringio) and mocks (flexmock) ...

Cheers,


Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,697
Latest member
AugustNabo

Latest Threads

Top