[Note: parts of this message were removed to make it a legal post.]
Here's a revised version of this parser that passes all of the tests:
#!/usr/bin/env ruby -wKU
Hi James,
I applied these fixes to my optimization of your StringScanner recursive
descent parser and applied a few more. I also got it working in ruby1.9.
It is at the end of this message. I couldn't figure out why yours wasn't
working with 1.9.
Also, I did some more benchmarking. This time I ran 4 passes for each
dataset and I picked the best bandwidth (ch/s) among the 2 longest running
datasets. I limited the depth to 10 because only the fastest 2 parsers need
deeper (to not parse <1s) and ruby seems to do strange things with garbage
collection. All of the fastest parsers are shown with results from the
depth=10 dataset.
I ran everything I could easily on 1.9 (didn't load any gems for 1.9).
Here are the results with this benchmarking that I think is more reliable:
1.8 1.9 F E gzip author/gem
------ --- - - ---- ----------
- - 5 0 545 Pawel Radecki (RE, recursive descent)
NL - 6 0 796 James Koppel (RE+StringIO, recursive descent)
NL NL 5 1 683 Justin Ethier (RE lexer, ruby eval, fixes)
NL - 3 2 1074 James Edward Gray II (peggy)
4662 SF 0 0 608 Eric Mahurin (Grammar0, no code-gen)
5264 - 6 2 588 ghostwheel (ghostwheel, fixes)
5764 - 2 0 706 Eric I (Treetop, unicode broken)
8246 - 2 0 631 Steve (Treetop, mismatches in benchmark)
11856 - 1 1 545 Clifford Heath (Treetop)
25841 - 0 0 842 Alexander Stedile (RE, recursive descent)
57217 - 0 0 723 Eric Mahurin (Grammar, v0.5)
152205 NL 2 1 660 Paolo Bonzini (RE, recursive descent)
- 191529 2 1 445 Thomas Link (RE lexer, ruby eval)
202090 - 0 0 774 James Edward Gray II (RE, recursive descent)
233646 - 0 0 653 Eric Mahurin (Grammar, unreleased)
270508 211266 1 0 - json
299259 - 6 0 - fjson (w/ C ext)
351163 235726 0 0 607 James & Eric M (RE, recursive)
359665 279539 3 0 405 Thomas & Paolo (RE + eval)
430251 106285 0 0 837 Eric Mahurin (LL(1), recursive descent)
2079190 - 0 0 653 Eric Mahurin (Grammar, unreleased, ruby2cext)
8534416 3217453 0 0 - json (w/ C ext)
NL: non-linear, SF: seg-fault
Here are some things to note:
* on closer inspection, some of the slower solutions are nonlinear (usually
quadratic runtime). ch/s is not meaningful since it just gets worse with
larger data sets.
* ruby1.9 killed my LL(1) parser performance. The reason for this is that
characters in 1.9 are full objects, where in 1.8 they are immediate
Fixnum's. 4X slower is not good. I need to go ask if we can get some
immediate 1-character Strings in ruby 1.9 to solve this. Anything that
worked with characters in 1.8 is going to see the same problem. My Grammar
classes generate similiar LL(1) parsers and will be affected in the same
way. It works so well in 1.8 because there is so little object creation
(using Regexp's creates more objects).
* Dominick Baton's ruby2cext is giving my dev Grammar class a 10X
performance boost. Woohoo! Only 4X away from the best hand-crafted C (not
bad considering the ruby and then the C were autogenerated). I generate
low-level stuff that optimizes well. High-level Regexp's will have little
benefit from ruby2cext.
Here is the benchmark code I used (along with the previously posted
RandomJSON class):
parser = JSONParser.new
generator = RandomJSON.new((ARGV[1]||0).to_i)
bandwidth = 0
bandwidth0 = 0
t0 = 0
l = nil
t = nil
max_depth = 10
(max_depth+1).times { |depth|
tree = generator.value(depth, depth%2)
s = tree.inspect
s.gsub!(/=>/, ':')
s.gsub!(/nil/, 'null')
tree2 = nil
l = s.length
t = nil
4.times {
Benchmark.bm { |b|
GC.start
t1 = b.report("#{depth} #{l} ") { tree2 = parser.parse(s) }
GC.start
raise("#{tree2.inspect}!=#{tree.inspect}") if tree2!=tree
GC.start
if (!t or t1.real<t.real)
t = t1
end
}
}
bandwidth = l/t.real
puts "#{bandwidth.to_i} chars/second"
break if (t.real>=(ARGV[0]||1).to_f or depth>=max_depth)
if (t.real>t0)
bandwidth0 = bandwidth
t0 = t.real
end
}
bandwidth = bandwidth0 if (bandwidth0>bandwidth)
puts "\n#{bandwidth.to_i} chars/second"
Here is another optimization of James' solution:
require "strscan"
class JSONParser
def parse(input)
@input = StringScanner.new(input)
@input.scan(/\s*/)
parse_value(out=[])
@input.eos? or error("Unexpected data")
out[0]
end
private
def parse_value(out)
if @input.scan(/\{\s*/)
object = {}
kv = []
until @input.scan(/\}\s*/)
object.empty? or @input.scan(/,\s*/) or error("Expected ,")
kv.clear
@input.scan(/"/) or error("Expected string")
parse_string(kv)
@input.scan(/:\s*/) or error("Expecting object separator")
parse_value(kv)
object[kv[0]] = kv[1]
end
out << object
elsif @input.scan(/\[\s*/)
array = []
until @input.scan(/\]\s*/)
array.empty? or @input.scan(/,\s*/) or error("Expected ,")
parse_value(array)
end
out << array
elsif @input.scan(/"/)
parse_string(out)
elsif @input.scan(/-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\s*/)
out << eval(@input.matched)
elsif @input.scan(/true\s*/)
out << true
elsif @input.scan(/false\s*/)
out << false
elsif @input.scan(/null\s*/)
out << nil
else
error("Illegal JSON value")
end
end
def parse_string(out)
string = ""
while true
if @input.scan(/[^\\"]+/)
string.concat(@input.matched)
elsif @input.scan(%r{\\["\\/]})
string << @input.matched[-1]
elsif @input.scan(/\\[bfnrt]/)
string << eval(%Q{"#{@input.matched}"})
elsif @input.scan(/\\u[0-9a-fA-F]{4}/)
string << @input.matched[2..-1].to_i(16)
else
break
end
end
@input.scan(/"\s*/) or error("Unclosed string")
out << string
end
def error(message)
if @input.eos?
raise "Unexpected end of input."
else
raise "#{message}: #{@input.peek(@input.string.length)}"
end
end
end