[QUIZ] Parsing JSON (#155)

T

tho_mica_l

I liked this quiz because it made me look into treetop, ragel and
some
other libraries I wanted to examine a little bit closer for quite a
while now. Anyway, for my (official) solution I took the easy road
and
rely on ruby to do the actual work.

My solution is ruby19 only, since I used the opportunity to explore
some
of the new regexp features. Since it uses eval(), there is a
possibility
for ruby code injection like the ultimatively relieving "#{`sudo rm -
rf
/`}". I think my solution catches such attacks though.

BTW, what do you all think should be the canonic output of the
following
JSON snippet:

json1 = <<JSON
{"a":2,"b":3.141,"TIME":"2007-03-14T11:52:40","c":"c","d":[1,"b",
3.14],"COUNT":666,"e":{"foo":"bar"},"foo":"B\\u00e4r","g":"\\u677e\
\u672c\\u884c\\u5f18","h":1000.0,"bar":"\\u00a9 \\u2260 \\u20ac!","i":
0.001,"j":"\\ud840\\udc01"}
JSON

I get conflicting results between various versions of my solution and
the official ruby19 parser with respect to these utf characters. This
snippet is taken (IIRC) from the ruby-json parser.

Regards,
Thomas.



#!/usr/bin/env ruby19
# Author:: Thomas Link (micathom AT gmail com)
# Created:: 2008-02-01.

# The string (in JSON format) is tokenized and pre-validated. Minor
# replacements are made in order to transform the JSON into valid
ruby
# input. The transformed string is then evaluated by ruby, which will
# throw an exception on syntactic errors.
#
# PROBLEMS:
# - The "parser" doesn't per se detect something like {"foo": 1,} or
# [1,2,] since this is valid in ruby. I'm not sure about JSON. Anyway,
I
# included another "invalid" clause in order to catch these cases of
# which I'm not sure how they are handled properly. If you want the
# parser to be more permissive, remove the first "invalid" clause.
#
# REFERENCES:
# http://json.org
# http://www.ietf.org/rfc/rfc4627.txt
class JSONParser

RXE = /
\[|\]|
\{|\}|
(?<name_sep>:)|
(?<invalid>,\s*[}\]])|
,|
(?<string>"([^"\\]++|\\(u[0-9a-fA-F]{4}|[bfnrt"\/\\]))*")|
-?(0|[1-9]\d*+)(\.\d++)?([Ee][+-]?\d++)?(?=\D|$)|
true|
false|
(?<null>null)|
[[:space:][:cntrl:]]++|
(?<invalid>.++)
/xmu

def parse(json)
ruby = json.gsub(RXE) do |t|
m = $~
if m['invalid'] then invalid(m['invalid'])
elsif m['null'] then 'nil'
elsif m['name_sep'] then '=>'
elsif m['string'] then m['string'].gsub(/#/, '\\\\#')
else
t
end
end
begin
return eval(ruby)
rescue Exception => e
invalid(json)
end
end

def invalid(string)
raise RuntimeError, 'Invalid JSON: %s' % string
end

end


if __FILE__ == $0
a = ARGV.join
p a
p JSONParser.new.parse(a)
end
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

cfp2:~ > cat a.rb
require 'rubygems'
require 'json'

The benchmark below has no gem dependencies, generates tests with deep
structures, and has self-checking. It found a bug that the original unit
tested didn't. By default, this benchmark keeps increasing the depth until
the runtime is >= 1s.

Also, I made some wrappers for the json and fjson gems to see how they
perform. Here are the results on my machine:

fjson: 224823 chars/second
json(pure): 220289 chars/second
json(ext): 1522250 chars/second

require 'benchmark'

class RandomJSON
def initialize(value=0)
@number = -1
@string = -1
@boolean = -1
@constant = -1
@value = value-1
end
def number
case (@number=(@number+1)%3)
when 0 : 0
when 1 : 1234
when 2 : 3.75e+1
end
end
def string
case (@string=(@string+1)%3)
when 0 : ""
when 1 : "JSON"
when 2 : "\"\\\/\b\f\r\t"
end
end
def boolean
case (@boolean=(@boolean+1)%3)
when 0 : false
when 1 : true
when 2 : nil
end
end
def constant
case (@constant=(@constant+1)%3)
when 0 : number
when 1 : string
when 2 : boolean
end
end
def array(depth)
a = []
depth.times {
a << value(depth-1)
}
a
end
def object(depth)
o = {}
depth.times {
o[string] = value(depth-1)
}
o
end
def value(depth, v=nil)
case (v||(@value=(@value+1)%3))
when 0 : array(depth)
when 1 : object(depth)
else constant
end
end
end

generator = RandomJSON.new((ARGV[1]||0).to_i)

parser = JSONParser.new
Benchmark.bm { |b|
l = nil; t = nil
13.times { |depth|
tree = generator.value(depth, depth%2)
s = tree.inspect
#puts s
s.gsub!(/=>/, ':')
s.gsub!(/nil/, 'null')
tree2 = nil
#puts s
l = s.length
t = b.report("#{depth} #{l}") { tree2 = parser.parse(s) }
raise if tree2!=tree
break if (t.real>=(ARGV[0]||1).to_f)
}
puts "#{(l/t.real).to_i} chars/second"
}
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]



This doc is missing two things: 1) exactly what is allowed for the top-level
json, and 2) where can whitespace appear. This is more complete:

http://www.ietf.org/rfc/rfc4627.txt

def test_keyword_parsing
assert_equal(true, @parser.parse("true"))
assert_equal(false, @parser.parse("false"))
assert_equal(nil, @parser.parse("null"))
end

def test_number_parsing
assert_equal(42, @parser.parse("42"))
...

end

def test_string_parsing
assert_equal(String.new, @parser.parse(%Q{""}))
...

end

The above isn't legal JSON. An array or object (hash) should be at the
top-level. Surround these by brackets and it will be legal.
 
T

tho_mica_l

Here is a patch to make my solution check the return value for
conformance with the rfc:

--- quiz155ir0.rb 2008-02-03 18:07:19.186889600 +0100
+++ quiz155i.rb 2008-02-03 18:08:58.008988800 +0100
@@ -48,10 +48,16 @@
end
end
begin
- return eval(ruby)
+ value = eval(ruby)
+ case value
+ when Array
+ return value
+ when Hash
+ return value if value.all? {|k, v| k.kind_of?
(String)}
+ end
rescue Exception => e
- invalid(json)
end
+ invalid(json)
end

def invalid(string)
 
P

Pawe³ Radecki

And here's my solution. It passes all given tests but still there are
some tricky cases not handled well (see: code comments). I used
regular expressions extensively but it may not be the best idea in
terms of performance.

Thanks for the quiz!!

#!/usr/bin/env ruby

# Solution to Ruby Quiz #155 (see http://www.rubyquiz.com/quiz155.html)
# by Pawe³ Radecki ([email protected]).

$KCODE='UTF-8'
require 'jcode'

class JSONParser

def parse(input)
case input
# TODO: in every case we need to check if pattern matches the
input thoroughly and nothing is left;
# ex. "[3, 5] Pablos" still not handled well, it passes through
instead of giving exception

when '' : raise RuntimeError

# TODO: There needs to be some smart way of choosing whether we
found an object or an array;
# now object has priority and it may be found instead of an
array

#object
when /\{(".+"):(.+)\s*(,\s*(".+"):(.+))+\}/ then
h = Hash.new
$&[1...-1].split(/(.*:.*)?\s*,\s*(.*:.*)?/).each do |e|
a = e.split(/:\s*(\{.*\}\s*)?/);
h[parse(a.first)] = parse(a.last) unless (a.first.nil? &&
a.last.nil?)
end
h
when /\{\s*(".+")\s*:\s*(.+)\s*\}/ then { parse($1) => parse($2) }
when /\{\s*\}/ : Hash.new

#array
when /\[.+\]/ then $&[1...-1].split(/(\[.*\])?\s*,\s*(\[.*
\])?/).collect{|e| parse(e)}
when /\[\s*\]/ then []

#constants
when /true/ then
if ($`.strip.empty? && $'.strip.empty?) then true else raise
RuntimeError end
when /false/ then
if ($`.strip.empty? && $'.strip.empty?) then false else raise
RuntimeError end
when /null/ then nil
if ($`.strip.empty? && $'.strip.empty?) then nil else raise
RuntimeError end

#string
when /"([A-Za-z]|(\s)|(\\")|(\\\\)|(\\\/)|(\\b)|(\\f)|(\\n)|(\\r)|
(\\t)|(\\u[0-9a-fA-F]{4,4}))+"/ : $&[1...-1].gsub(/\\"/, '"').gsub(/\
\n/, "\n").gsub(/\\u([0-9a-fA-F]{4,4})/u){["#$1".hex ].pack('U*')}
when /""/ then ""

#number
when /-?(0|([1-9][0-9]*))(\.[0-9]+)?([e|E][+|-]?[0-9]+)?/ then
if ($`.strip.empty? && $'.strip.empty?) then eval($&) else raise
RuntimeError end
else
raise RuntimeError
end
end
end

#puts JSONParser.new.parse(ARGV.first)
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

Hey guys

This is my first parser. I used Nathan Sobo's Treetop parsing library (
http://treetop.rubyforge.org/, gem install treetop):

http://pastie.caboo.se/146906

require 'treetop'


File.open("json.treetop", "w") {|f| f.write GRAMMAR }


Treetop.load "json"

parser = JsonParser.new


pp parser.parse(STDIN.read).value if $0 == __FILE__



BEGIN {

GRAMMAR = %q{

grammar Json

...


Steve,

I tried this parser, but couldn't get it to work with the String interface
of the quize testbench. Using an IO/stream is how a parser should work, but
you should also be able to wrap a String in a StringIO. It complained about
StringIO#index not being defined, but IO#index isn't defined either.
Couldn't find anyplace where IO#index was monkey-patched.

Eric
 
C

Clifford Heath

Eric said:
I tried this parser, but couldn't get it to work with the String interface
of the quize testbench. Using an IO/stream is how a parser should work,

Not a packrat parser. If you have memory to memoize each grammar rule
attempted at each byte-offset of the input, you have memory to have all
the input in memory at once, and it's more efficient to do so.

Obviously a simple adapter can fix that, but steve's doesn't, and
neither does mine (emailed to JEG before the 24 hours - I'll post
it soon). The adapter is needed because the parse() method expected
is different from the one that Treetop's parsers provide.

Clifford Heath.
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

Not a packrat parser. If you have memory to memoize each grammar rule
attempted at each byte-offset of the input, you have memory to have all
the input in memory at once, and it's more efficient to do so.

Obviously a simple adapter can fix that, but steve's doesn't, and
neither does mine (emailed to JEG before the 24 hours - I'll post
it soon). The adapter is needed because the parse() method expected
is different from the one that Treetop's parsers provide.

I was expecting this to work as a proxy to the Treetop parser:

class JSONParser
def initialize
@parser = JsonParser.new
end
def parse(s)
@parser.parse(StringIO.new(s)).value
end
end
 
C

Clifford Heath

Eric said:
I was expecting this to work as a proxy to the Treetop parser:

class JSONParser
def initialize
@parser = JsonParser.new
end
def parse(s)
@parser.parse(StringIO.new(s)).value
end
end

You need to pass the string directly, not as a StringIO.
Treetop parses a String, not an IO.
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

In honor of that, this week's Ruby Quiz is to write a parser for JSON.


Here is another solution that uses my (somewhat) optimized 'grammar' package
directly. The JSON grammar is similar to before. The next release will use
an API closer to the simple Grammar0 package.

http://pastie.caboo.se/147078

Also, I ported this to my development code (which is not checked in to CVS
yet), to see the performance. I grabbed all of the current submissions
along with the fjson and json gems to see how they compare in terms of
performance. I used the previous benchmark that I posted. It also revealed
bugs in these submissions. Here is the performance I found on my machine
with ruby 1.8.6:

ch/s author/gem
---- ----------
- oksteev (TreeTop, couldn't get it to parse a string)
- Pawel Radecki (RE, mismatch)
- Justin Ethier (RE lexer + parser, 71: Invalid number 0)
4054 Eric Mahurin (Grammar0, no lexer, no parser generation)
54586 Eric Mahurin (Grammar, no lexer, v0.5)
166041 Thomas Link (RE, ruby 1.9 results)
220289 json
223486 Eric Mahurin (Grammar, no lexer, unreleased)
224823 fjson (uses C extensions)
553081 Eric Mahurin (Grammar, no lexer, unreleased, w/ ruby2cext)
1522250 json (w/ C extensions)

Note that the Grammar variants don't have the advantage of using RegExp
where you get some C performance. But, in my dev code, I'm using ruby2cext
to get a little more performance. You could integrate a RegExp lexer with a
Grammar parser, also.
 
C

Clifford Heath

Here's mine, done with Treetop. It also includes a
Readline-based interpretive checker. The generated
parser from Treetop has a slightly different interface,
so I've included JEG's test program with an adapter at
the top.

The nicest thing about using Treetop is how close the
grammar to the JSON spec :).

I prefer to convert hash keys to symbols, but the test
cases don't allow that so I stripped out my .to_sym's.

Note that the test cases are rather limited in things
like white-space handling (and in fact the JSON spec is
actually incorrect, in that it doesn't define which rules
constitute tokens that may be separated by whitespace!)
Whitespace in Treetop must be handled explicitly, and it's
easy to miss a spot where it should be skipped, so the
tests should cover that.

I welched on full Unicode support as commented in my code,
but there's another test case you should apply, to parse
the string "\\u1234", which should throw an exception.
You'll see that my code is missing that exception, and
will misbehave instead :).

It wasn't clear from the quiz or the JSON spec whether an
integer is valid JSON. I elected to accept any value, not
just an object or array.

Treetop now uses Polyglot, which loads the generated .rb
file if you've generated it, or the .treetop file if not.

Clifford Heath.

First, the interactive test program:

require 'treetop'
require 'json' # Note that we can require the Treetop file directly.
require 'readline'

parser = JsonParser.new
while line = Readline::readline("? ", [])
begin
tree = parser.parse(line)
if tree
p tree.obj
else
puts parser.failure_reason
end
rescue => e
puts e
p e.backtrace
p tree if tree
end
end
puts

Now, my test adapter:

class JSONParser
def parse(text)
parser = JsonParser.new
p = parser.parse(text)
raise parser.failure_reason unless p
p.obj
end
end

Finally, the grammar itself:

# Treetop grammar for JSON for Ruby Quiz #155 by Clifford Heath.
grammar Json
rule json
value
end

rule object
'{' s pairs:pairs? s '}' s
{ def obj
pairs.empty? ? {} : pairs.obj
end
}
end

rule pairs
member rest:(s ',' s member)*
{ def obj
rest.elements.inject({eval(member.k.text_value) => member.value.obj}) { |h, e|
h[eval(e.member.k.text_value)] = e.member.value.obj
h
}
end
}
end

rule member # key/value pair of an object
k:string s ':' s value
end

rule array
'[' s e:elements? s ']'
{ def obj
e.empty? ? [] : e.obj
end
}
end

rule elements # elements of an array
value rest:(s ',' s value)*
{ def obj
rest.elements.inject([value.obj]) { |a, e|
a << e.value.obj
}
end
}
end

rule value
s alt:(string / number / object / array
/ 'true' { def obj; true; end }
/ 'false' { def obj; false; end }
/ 'null' { def obj; nil; end }
)
{ def obj; alt.obj; end }
end

rule string
'"' char* '"' { def obj
eval(
# Strip Unicode characters down to the chr equivalent.
# Note that I'm cheating here: '"\\u4321"' should assert,
# and there are cases that will succeed but corrupt the data.
# This should be handled in the "char" rule.
text_value.gsub(/\\u..../) { |unicode|
eval("0x"+unicode[2..-1]).chr
}
)
end
}
end

rule char
'\\' [\"\\\/bfnrt]
/ '\\u' hex hex hex hex
/ (![\\"] .)
end

rule hex
[0-9A-Fa-f]
end

rule number
int frac? exp? { def obj; eval(text_value); end }
end

rule int # Any integer
'-'? ([1-9] [0-9]* / '0')
{ def obj; eval(text_value); end }
end

rule frac # The fractional part of a floating-point number
'.' [0-9]+
end

rule exp # An exponent
[eE] [-+]? [0-9]+
end

rule s # Any amount of whtespace
[ \t\n\t]*
end

end
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

You need to pass the string directly, not as a StringIO.
Treetop parses a String, not an IO.


I guess my eyes missed the ".read" after the STDIN . The "STDIN" jumped out
and I thought it took an IO-like object. I was able to just remove a couple
lines of Steve's code and rename the class to get it to work with the tests
(unit and random performance benchmark). It mismatches in both the unit
tests and the performance benchmark, but I if I remove the self-checking I
was able to run the performance benchmark. It was only about 6800
chars/second which is near the bottom of the heap. Not sure how valid that
is since the parser has problems though.

Eric
 
E

Eric I.

Here is my solution, which, like steve's, uses Treetop. This is my
first attempt at using Treetop, so I'm not certain how good the
grammar file that I created is. steve's grammar file is smaller than
mine is.

When I benchmarked my code, I'm getting a parsing performance of
around 10,000 chars/second, much lower than many of the other
solutions.

The two components of the solution are below -- a small Ruby class
(that acts as an interface b/w James' unit testing and what Treetop
produces) and a Treetop grammar file.

Eric

====

Here's the small Ruby class:

# A solution to RubyQuiz #155.
#
# Takes a JSON string and parses it into an equivalent Ruby value.
#
# See http://www.rubyquiz.com/quiz155.html for details.
#
# The latest version of this solution can also be found at
# http://learnruby.com/examples/ruby-quiz-155.shtml .

require 'rubygems'
require 'treetop'

Treetop.load 'json'

class JSONParser
def initialize
@parser = JSONHelperParser.new
end

def parse(input)
result = @parser.parse(input)
raise "could not parse" if result.nil?
result.resolve
end
end

====

And here's the Treetop grammar:

# Treetop grammar for JSON. Note: it doesn't handle Unicode very
well.

grammar JSONHelper
rule value
spaces v:(simple_literal / string / number / array / object)
spaces {
def resolve
v.resolve
end
}
end

rule spaces
[ \t\n\r]*
end

# SIMPLE LITERALS

rule simple_literal
("true" / "false" / "null") {
def resolve
case text_value
when "true" : true
when "false" : false
when "null" : nil
end
end
}
end

# NUMBERS

rule number
integer fractional:fractional? exponent:exponent? {
def resolve
if fractional.text_value.empty? && exponent.text_value.empty?
integer.text_value.to_i
else
text_value.to_f
end
end
}
end

rule integer
"-"? [1-9] digits
/
# single digit
"-"? [0-9]
end

rule fractional
"." digits
end

rule exponent
[eE] [-+]? digits
end

rule digits
[0-9]*
end

# STRINGS

rule string
"\"\"" {
def resolve
""
end
}
/
"\"" characters:character* "\"" {
def resolve
characters.elements.map { |c| c.resolve }.join
end
}
end

rule character
# regular characters
(!"\"" !"\\" .)+ {
def resolve
text_value
end
}
/
# escaped: \\, \", and \/
"\\" char:("\"" / "\\" / "/") {
def resolve
char.text_value
end
}
/
# escaped: \b, \f, \n, \r, and \t
"\\" char:[bfnrt] {
def resolve
case char.text_value
when 'b' : "\b"
when 'f' : "\f"
when 'n' : "\n"
when 'r' : "\r"
when 't' : "\t"
end
end
}
/
# for Unicode that overlay ASCII values, we just use the ASCII
value
"\\u00" digits:(hex_digit hex_digit) {
def resolve
str = " "
str[0] = digits.text_value.hex
str
end
}
/
# for all other Unicode values use the null character
"\\u" digits1:(hex_digit hex_digit) digits2:(hex_digit hex_digit)
{
def resolve
str = " "
str[0] = digits1.textvalue.hex
str[1] = digits2.textvalue.hex
str
end
}
end

rule hex_digit
[0-9a-fA-F]
end

# ARRAYS

rule array
"[" spaces "]" {
def resolve
Array.new
end
}
/
"[" spaces value_list spaces "]" {
def resolve
value_list.resolve
end
}
end

rule value_list
value !(spaces ",") {
def resolve
[ value.resolve ]
end
}
/
value spaces "," spaces value_list {
def resolve
value_list.resolve.unshift(value.resolve)
end
}
end

# OBJECTS

rule object
"{" spaces "}" {
def resolve
Hash.new
end
}
/
"{" spaces pair_list spaces "}" {
def resolve
pair_list.resolve
end
}
end

rule pair_list
pair !(spaces ",") {
def resolve
{ pair.resolve[0] => pair.resolve[1] }
end
}
/
pair spaces "," spaces pair_list {
def resolve
hash = pair_list.resolve
hash[pair.resolve[0]] = pair.resolve[1]
hash
end
}
end

rule pair
string spaces ":" spaces value {
def resolve
[ string.resolve, value.resolve ]
end
}
end
end
 
C

Clifford Heath

Eric said:
It was only about 6800
chars/second which is near the bottom of the heap. Not sure how valid that
is since the parser has problems though.

I'm not surprised. Treetop has been designed to be super-clean and pure,
not fast. It'll get faster. Myself, I think it should have real Regex's at
the leaves, and that'd make a *heap* of difference.

Also, MRI is glacially slow at creating objects, which Treetop does a lot
of (in this case, for every char in a string for example). If it used Structs,
or if it used Rubinius, things would be a lot quicker.

One other addition we're considering is to add skip rules, which don't build
nodes, for things like white-space and punctuation, which would basically
double the speed.

Consider also that the backtracking allows you to handle grammars that are
simply impossible with traditional techniques... like I'm doing with CQL
in ActiveFacts.

Clifford Heath.
 
S

steve

[Note: parts of this message were removed to make it a legal post.]
I guess my eyes missed the ".read" after the STDIN . The "STDIN" jumped
out
and I thought it took an IO-like object. I was able to just remove a
couple
lines of Steve's code and rename the class to get it to work with the
tests
(unit and random performance benchmark). It mismatches in both the unit
tests and the performance benchmark, but I if I remove the self-checking I
was able to run the performance benchmark. It was only about 6800
chars/second which is near the bottom of the heap. Not sure how valid
that
is since the parser has problems though.

Hey Eric,

What do you mean it mismatches? Also what do you mean by self-checking?
It passes all the unit tests for me.

I just found out why you probably had problems with my solution on the
benchmark. You need to call .value on the object returned by the parser to
get the actual parsed json value.

t = b.report("#{depth} #{l}") { tree2 = parser.parse(s) }
-->
t = b.report("#{depth} #{l}") { tree2 = parser.parse(s).value }

- steve
 
S

steve

[Note: parts of this message were removed to make it a legal post.]

Here are some extra unit tests i wrote. Let me know if you think any of
them are incorrect (per the spec):

def test_more_numbers
assert_equal(5, @parser.parse("5"))
assert_equal(-5, @parser.parse("-5"))
assert_equal 45.33, @parser.parse("45.33")
assert_equal 0.33, @parser.parse("0.33")
assert_equal 0.0, @parser.parse("0.0")
assert_equal 0, @parser.parse("0")
assert_raises(RuntimeError) { @parser.parse("-5.-4") }
assert_raises(RuntimeError) { @parser.parse("01234") }
assert_equal(0.2e1, @parser.parse("0.2E1"))
assert_equal(42e10, @parser.parse("42E10"))
end

def test_more_string
assert_equal("abc\befg", @parser.parse(%Q{"abc\\befg"}))
assert_equal("abc\nefg", @parser.parse(%Q{"abc\\nefg"}))
assert_equal("abc\refg", @parser.parse(%Q{"abc\\refg"}))
assert_equal("abc\fefg", @parser.parse(%Q{"abc\\fefg"}))
assert_equal("abc\tefg", @parser.parse(%Q{"abc\\tefg"}))
assert_equal("abc\\efg", @parser.parse(%Q{"abc\\\\efg"}))
assert_equal("abc/efg", @parser.parse(%Q{"abc\\/efg"}))
end

def test_more_object_parsing
assert_equal({'a'=>2,'b'=>4}, @parser.parse(%Q{{ "a" : 2 , "b":4 }}))
assert_raises(RuntimeError) { @parser.parse(%Q{{ "a" : 2, }}) }
assert_raises(RuntimeError) { @parser.parse(%Q{[ "a" , 2, ]}) }
end


The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz
until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps
everyone
on Ruby Talk follow the discussion. Please reply to the original quiz
message,
if you can.


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

There has been a lot of talk recently about parsing with Ruby. We're
seeing
some parser generator libraries pop up that make the task that much easier
and
they've been stirring up interest.

In honor of that, this week's Ruby Quiz is to write a parser for JSON.

JSON turns out to turns out to be a great little example for writing
parsers for
two reasons. First, it's pretty easy stuff. You can hand-roll a JSON
parser in
under 100 lines of Ruby. The second advantage is that the data format is
wonderfully documented:

http://json.org/

Since JSON is just a data format and Ruby supports all of the data types,
I vote
we just use Ruby itself as the abstract syntax tree produced by the parse.

Feel free to show off your favorite parser generator, if you don't want to
roll
your own. Anything goes.

Here are a few tests to get you started:

require "test/unit"

class TestJSONParser < Test::Unit::TestCase
def setup
@parser = JSONParser.new
end

def test_keyword_parsing
assert_equal(true, @parser.parse("true"))
assert_equal(false, @parser.parse("false"))
assert_equal(nil, @parser.parse("null"))
end

def test_number_parsing
assert_equal(42, @parser.parse("42"))
assert_equal(-13, @parser.parse("-13"))
assert_equal(3.1415, @parser.parse("3.1415"))
assert_equal(-0.01, @parser.parse("-0.01"))

assert_equal(0.2e1, @parser.parse("0.2e1"))
assert_equal(0.2e+1, @parser.parse("0.2e+1"))
assert_equal(0.2e-1, @parser.parse("0.2e-1"))
assert_equal(0.2E1, @parser.parse("0.2e1"))
end

def test_string_parsing
assert_equal(String.new, @parser.parse(%Q{""}))
assert_equal("JSON", @parser.parse(%Q{"JSON"}))

assert_equal( %Q{nested "quotes"},
@parser.parse('"nested \"quotes\""') )
assert_equal("\n", @parser.parse(%Q{"\\n"}))
assert_equal( "a",
@parser.parse(%Q{"\\u#{"%04X" % ?a}"}) )
end

def test_array_parsing
assert_equal(Array.new, @parser.parse(%Q{[]}))
assert_equal( ["JSON", 3.1415, true],
@parser.parse(%Q{["JSON", 3.1415, true]}) )
assert_equal([1, [2, [3]]], @parser.parse(%Q{[1, [2, [3]]]}))
end

def test_object_parsing
assert_equal(Hash.new, @parser.parse(%Q{{}}))
assert_equal( {"JSON" => 3.1415, "data" => true},
@parser.parse(%Q{{"JSON": 3.1415, "data": true}})
)
assert_equal( { "Array" => [1, 2, 3],
"Object" => {"nested" => "objects"} },
@parser.parse(<<-END_OBJECT) )
{"Array": [1, 2, 3], "Object": {"nested": "objects"}}
END_OBJECT
end

def test_parse_errors
assert_raise(RuntimeError) { @parser.parse("{") }
assert_raise(RuntimeError) { @parser.parse(%q{{"key": true
false}}) }

assert_raise(RuntimeError) { @parser.parse("[") }
assert_raise(RuntimeError) { @parser.parse("[1,,2]") }

assert_raise(RuntimeError) { @parser.parse(%Q{"}) }
assert_raise(RuntimeError) { @parser.parse(%Q{"\\i"}) }

assert_raise(RuntimeError) { @parser.parse("$1,000") }
assert_raise(RuntimeError) { @parser.parse("1_000") }
assert_raise(RuntimeError) { @parser.parse("1K") }

assert_raise(RuntimeError) { @parser.parse("unknown") }
end
end
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

Hey Eric,

What do you mean it mismatches? Also what do you mean by self-checking?
It passes all the unit tests for me.

I just found out why you probably had problems with my solution on the
benchmark. You need to call .value on the object returned by the parser
to
get the actual parsed json value.

t = b.report("#{depth} #{l}") { tree2 = parser.parse(s) }
-->
t = b.report("#{depth} #{l}") { tree2 = parser.parse(s).value }

- steve


Thanks Steve,

That was it. I didn't make a wrapper class like I did for the other Treetop
parsers. Your parser works without problems now.

Also, I grabbed Eric I's and Clifford Heath's Treetop parsers and
benchmarked them. Here's what I found:

ch/s author/gem
---- ----------
- Pawel Radecki (RE, mismatch)
3214 Justin Ethier (RE lexer + parser, fixed number parsing)
4054 Eric Mahurin (Grammar0, no lexer, no parser generation)
4078 Eric I (Treetop, unicode broken)
6534 oksteev (Treetop, mismatches in benchmark)
8313 Clifford Heath (Treetop, had to remove handling of "\/")
54586 Eric Mahurin (Grammar, no lexer, v0.5)
166041 Thomas Link (RE, ruby 1.9 results)
220289 json
223486 Eric Mahurin (Grammar, no lexer, unreleased)
224823 fjson (uses C extensions)
553081 Eric Mahurin (Grammar, no lexer, unreleased, w/ ruby2cext)
1522250 json (w/ C extensions)
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

I'm not surprised. Treetop has been designed to be super-clean and pure,
not fast. It'll get faster. Myself, I think it should have real Regex's at
the leaves, and that'd make a *heap* of difference.


I'm biased, but I don't think Treetop is super-clean and pure. Take a look
at my rubyforge grammar package if you want something really clean and pure:

* specify language grammars in a BNF-like ruby DSL (no new grammar language
to deal with!)
- integration of lexer/parser with other code is seamless since everything
is ruby
- (sub)grammars are objects and you combine them using operators and
methods
* complete API can be described in only a little more than a hundred lines
of code (but the implementation is a lot more since it has lots of
optimizations)
* heavily duck-typed such that a lexer or parser (or preprocessor, or parser
w/o lexer) can specified by the same DSL
* infinite backtracking (not by default because of performance and
functionality, you specify what needs it)
* can make any kind of pipeline of lexers/parsers and multi-thread them.
* even with its pure ruby/no-regexp lexers/parsers, it can go head-to-head
against Regexp-based stuff. The flexibility of Regexp is limited, so I
don't see the point since 'grammar' gets enough performance.
* don't need to read the file into a string to parse it
* and more

Also, MRI is glacially slow at creating objects, which Treetop does a lot
of (in this case, for every char in a string for example). If it used
Structs,
or if it used Rubinius, things would be a lot quicker.

One other addition we're considering is to add skip rules, which don't
build
nodes, for things like white-space and punctuation, which would basically
double the speed.


With 'grammar' you have complete control of the parsing result. Without
actions, the output stream is a copy of the input stream. You can group
things and discard things efficiently. One extreme would be to interpret
the input on the fly and not build anything. I have an example tcl
interpreter done in a couple hundred lines of code using 'grammar', that
produces no AST - it just interprets while parsing.
 
C

Clifford Heath

Eric said:
I'm biased,

First up, it sounds like your package is a significant achievement,
and I don't want anything here to imply I'm not commending you on
it. But different approaches suit different goals...
but I don't think Treetop is super-clean and pure. Take a look
at my rubyforge grammar package if you want something really clean and pure:

:) To me, to use a separate grammar is *more pure* than to use
a Ruby DSL. I can see how you would argue the opposite however...
Your example parsers are much harder to read than Treetop however,
compare:
space = Grammar::Element[Set[?\ ,?\t].duck!:)==,:include?)].discard
with
skip space
[ \t]
end

(ok, so TT doesn't have a skip keyword yet, soon...)
That's what I mean by clean and pure anyhow. Not pure Ruby,
but pure grammar.

Nathan's concept is that "grammar" should become a Ruby keyword,
and should introduce the grammar meta-grammar, rather than using
existing Ruby syntax. I think that polyglot approach is much better
than the DSL approach, since the Ruby grammar isn't well-suited to
many important languages.

The point is, if/when Ruby uses a parser whose grammar is composable
with a DSL grammar you define, the result is truly seamless and
the syntax is unrestrained. Existing DSL syntax imposes too many
restraints. A language with a composable grammar would be truly
excellent!
* specify language grammars in a BNF-like ruby DSL (no new grammar language
to deal with!)

I think I've argued why that is a bad thing :), though it definitely
has good things about it also - like your filters etc...

Ruby DSL is far too awkward to express CQL - which is my primary
project at present.
- integration of lexer/parser with other code is seamless since everything
is ruby

Nice, but it won't help me when I need to target C# or Java.
Most important grammars need to be compilable to more than
one target language. Treetop's not there yet, but it's easy
to see how to do it, whereas it won't ever happen with yours
AFAICS.
- (sub)grammars are objects and you combine them using operators and
methods

Can't get much easier than just "include otherGrammar", which
works in Treetop (mix the generated modules together).
* complete API can be described in only a little more than a hundred lines

Similarly to Treetop, AFAICS.
* infinite backtracking (not by default because of performance and
functionality, you specify what needs it)

Does this memoize, or give exponential performance? Or is it optional?
* can make any kind of pipeline of lexers/parsers and multi-thread them.
* even with its pure ruby/no-regexp lexers/parsers, it can go head-to-head
against Regexp-based stuff. The flexibility of Regexp is limited, so I
don't see the point since 'grammar' gets enough performance.
* don't need to read the file into a string to parse it
* and more

Very cool stuff.
With 'grammar' you have complete control of the parsing result. Without
actions, the output stream is a copy of the input stream. You can group
things and discard things efficiently. One extreme would be to interpret
the input on the fly and not build anything.

Yes, this is excellent, and something I've always wanted from a parser
generator. You should add your json parsers to the examples in the gem!

Clifford Heath.
 
J

Joel VanderWerf

Clifford said:
Your example parsers are much harder to read than Treetop however,
compare:
space = Grammar::Element[Set[?\ ,?\t].duck!:)==,:include?)].discard

Surely you can sugar that up a bit, something like

space = Grammar.charset(" \t").discard
with
skip space
[ \t]
end
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,962
Messages
2,570,134
Members
46,692
Latest member
JenniferTi

Latest Threads

Top