Faster CSV parsing

W

William James

## Read, parse, and create csv records.

# The program conforms to the csv specification at this site:
# http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
# The only extra is that you can change the field-separator.
# For a field-separator other than a comma, for example
# a semicolon:
# Csv.fs=";"
#
# After a record has been read and parsed,
# Csv.string contains the record in raw string format.
#


class Csv

def Csv.unescape( array )
array.map{|x| x.gsub( /""/, '"' ) }
end


@@fs = ","

# Set regexp for parse.
# @@fs is the field-separator, which must be
# a single character.
def Csv.make_regexp
fs = @@fs
if "^" == fs
fs = "\\^"
end

@@regexp =
## Assumes embedded quotes are escaped as "".
%r{
\G ## Anchor at end of previous match.
[ \t]* ## Leading spaces or tabs are discarded.
(?:
## For a field in quotes.
" ( [^"]* (?: "" [^"]* )* ) " |
## For a field not in quotes.
( [^"\n#{fs}]*? )
)
[ \t]*
[#{fs}]
}mx



## When get_rec finds after reading a line that the record isn't
## complete, this regexp will be used to decide whether to read
## another line or to raise an exception.
@@reading_regexp =
%r{
\A # Anchor at beginning of string.
(?:
[ \t]*
(?:
" [^"]* (?: "" [^"]* )* " |
[^"\n#{fs}]*?
)
[ \t]*
[#{fs}]
)*
[ \t]*
" [^"]* (?: "" [^"]* )*
\Z # Anchor at end of string.
}mx

end # def make_regexp

Csv.make_regexp

def Csv.parse( s )
ary = (s + @@fs).scan( @@regexp )
raise "\nBad csv record:\n#{s}\n" if $' != ""
Csv.unescape( ary.flatten.compact )
end

@@string = nil

def Csv.get_rec( file )
return nil if file.eof?
@@string = ""
begin
if @@string.size>0
raise "\nBad record:\n#{@@string}\n" if
@@string !~ @@reading_regexp
raise "\nPremature end of csv file." if file.eof?
end
@@string += file.gets
end until @@string.count( '"' ) % 2 == 0
@@string.chomp!
Csv.parse( @@string )
end

def Csv.string
@@string
end

def Csv.fs=( s )
raise "\nCsv.fs must be a single character.\n" if s.size != 1
@@fs = s
Csv.make_regexp
end

def Csv.fs
@@fs
end

def Csv.to_csv( array )
s = ''
array.map { |item|
str = item.to_s
# Quote the string if it contains the field-separator or
# a " or a newline or a carriage-return, or if it has leading or
# trailing whitespace.
if str.index(@@fs) or /^\s|["\r\n]|\s$/.match(str)
str = '"' + str.gsub( /"/, '""' ) + '"'
end
str
}.join(@@fs)
end

end # class Csv
 
G

Gavin Kistner

## Read, parse, and create csv records.

# The program conforms to the csv specification at this site:
# http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
# The only extra is that you can change the field-separator.
# For a field-separator other than a comma, for example
# a semicolon:
# Csv.fs=";"
#
# After a record has been read and parsed,
# Csv.string contains the record in raw string format.

Thanks for sharing this. You claim 'faster' CSV parsing. Faster than
what, and by how much? Got any benchmark results to share?
 
W

William James

Gavin said:
Thanks for sharing this. You claim 'faster' CSV parsing. Faster than
what,

Than Ruby's standard-lib csv.rb.
and by how much? Got any benchmark results to share?

For my test file of 1,964,211 bytes, it's about 6.4 times as fast.
 
R

Robert Klemme

William James said:
Than Ruby's standard-lib csv.rb.


For my test file of 1,964,211 bytes, it's about 6.4 times as fast.

I didn't look too closely at it nor did I test it but your use of class
variables seems not necessary here. I'd prefer to transform them into
instance variables. Makes your code much more robust...

Kind regards

robert
 
G

gabriele renzi

William James ha scritto:
Than Ruby's standard-lib csv.rb.




For my test file of 1,964,211 bytes, it's about 6.4 times as fast.

what things is this missing wrt standard csv.rb?

Also, why you did choose to make all of the stuff (methods, variables)
at class level instead of instance ?
 
S

Stefan Lang

--Boundary-00=_98NZDJwsHvMffL+
Content-Type: Multipart/Mixed;
boundary="Boundary-00=_98NZDJwsHvMffL+";
charset="iso-8859-1"

William James ha scritto: [...]
For my test file of 1,964,211 bytes, it's about 6.4 times as
fast.

what things is this missing wrt standard csv.rb?

Also, why you did choose to make all of the stuff (methods,
variables) at class level instead of instance ?

A more OO (and equally fast) version:

## Read, parse, and create csv records.

# The program conforms to the csv specification at this site:
# http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
# The only extra is that you can change the field-separator.
# For a field-separator other than a comma, for example
# a semicolon:
# csv.fs=";" # csv is a FastCsv instance
#
# After a record has been read and parsed,
# csv.string contains the record in raw string format.
#

class FastCsv

def self.foreach(filename)
csv = self.new
open filename do |file|
while record = csv.get_rec(file)
yield record
end
end
end

def initialize(fs = ",")
self.fs = fs
@string = nil
end

def fs=(s)
raise "fs must be a single character." if s.size != 1
@fs = s.dup
make_regexp
end

def fs
@fs.dup
end

def parse(s)
ary = (s + @fs).scan(@regexp)
raise "Bad csv record:\n#{s}\n" if $' != ""
unescape(ary.flatten.compact)
end

def get_rec(file)
return nil if file.eof?
@string = ""
begin
if @string.size > 0
raise "Bad record:\n#@string\n" if @string !~
@reading_regexp
raise "Premature end of csv file." if file.eof?
end
@string += file.gets
end until @string.count('"') % 2 == 0
@string.chomp!
parse(@string)
end

def string
@string
end

def to_csv(array)
s = ''
array.map { |item|
str = item.to_s
# Quote the string if it contains the field-separator or
# a " or a newline or a carriage-return, or if it has
leading or
# trailing whitespace.
if str.index(@fs) or /^\s|["\r\n]|\s$/.match(str)
str = '"' + str.gsub( /"/, '""' ) + '"'
end
str
}.join(@fs)
end

private

def unescape(array)
array.map { |x| x.gsub(/""/, '"') }
end

# Set regexp for parse.
# @fs is the field-separator, which must be
# a single character.
def make_regexp
fs = @fs
if "^" == fs
fs = "\\^"
end

@regexp =
## Assumes embedded quotes are escaped as "".
%r{
\G ## Anchor at end of previous match.
[ \t]* ## Leading spaces or tabs are discarded.
(?:
## For a field in quotes.
" ( [^"]* (?: "" [^"]* )* ) " |
## For a field not in quotes.
( [^"\n#{fs}]*? )
)
[ \t]*
[#{fs}]
}mx

## When get_rec finds after reading a line that the record
isn't
## complete, this regexp will be used to decide whether to
read
## another line or to raise an exception.
@reading_regexp =
%r{
\A # Anchor at beginning of string.
(?:
[ \t]*
(?:
" [^"]* (?: "" [^"]* )* " |
[^"\n#{fs}]*?
)
[ \t]*
[#{fs}]
)*
[ \t]*
" [^"]* (?: "" [^"]* )*
\Z # Anchor at end of string.
}mx
end # make_regexp

end # class FastCsv

if $0 == __FILE__
FastCsv.foreach("example.csv") { |record| p record }
end
--Boundary-00=_98NZDJwsHvMffL+--
 
W

William James

Robert said:
I didn't look too closely at it nor did I test it but your use of class
variables seems not necessary here. I'd prefer to transform them into
instance variables. Makes your code much more robust...

Kind regards

robert

If anyone wants to make it more robust, he is free to do so.
I have little need for csv parsing, and I don't want to spend
much more time on this.

The people on Ruby Core who are trying to speed up CSV parsing
could use this as a starting point.
 
S

Stefan Lang

I'm sending this message a second time. Seems that my client
has messed up the first. Sorry for the noise.

William James ha scritto: [...]
For my test file of 1,964,211 bytes, it's about 6.4 times as
fast.

what things is this missing wrt standard csv.rb?

Also, why you did choose to make all of the stuff (methods,
variables) at class level instead of instance ?

A more OO (and equally fast) version:

## Read, parse, and create csv records.

# The program conforms to the csv specification at this site:
# http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
# The only extra is that you can change the field-separator.
# For a field-separator other than a comma, for example
# a semicolon:
# csv.fs=";" # csv is a FastCsv instance
#
# After a record has been read and parsed,
# csv.string contains the record in raw string format.
#

class FastCsv

def self.foreach(filename)
csv = self.new
open filename do |file|
while record = csv.get_rec(file)
yield record
end
end
end

def initialize(fs = ",")
self.fs = fs
@string = nil
end

def fs=(s)
raise "fs must be a single character." if s.size != 1
@fs = s.dup
make_regexp
end

def fs
@fs.dup
end

def parse(s)
ary = (s + @fs).scan(@regexp)
raise "Bad csv record:\n#{s}\n" if $' != ""
unescape(ary.flatten.compact)
end

def get_rec(file)
return nil if file.eof?
@string = ""
begin
if @string.size > 0
raise "Bad record:\n#@string\n" if @string !~
@reading_regexp
raise "Premature end of csv file." if file.eof?
end
@string += file.gets
end until @string.count('"') % 2 == 0
@string.chomp!
parse(@string)
end

def string
@string
end

def to_csv(array)
s = ''
array.map { |item|
str = item.to_s
# Quote the string if it contains the field-separator or
# a " or a newline or a carriage-return, or if it has
leading or
# trailing whitespace.
if str.index(@fs) or /^\s|["\r\n]|\s$/.match(str)
str = '"' + str.gsub( /"/, '""' ) + '"'
end
str
}.join(@fs)
end

private

def unescape(array)
array.map { |x| x.gsub(/""/, '"') }
end

# Set regexp for parse.
# @fs is the field-separator, which must be
# a single character.
def make_regexp
fs = @fs
if "^" == fs
fs = "\\^"
end

@regexp =
## Assumes embedded quotes are escaped as "".
%r{
\G ## Anchor at end of previous match.
[ \t]* ## Leading spaces or tabs are discarded.
(?:
## For a field in quotes.
" ( [^"]* (?: "" [^"]* )* ) " |
## For a field not in quotes.
( [^"\n#{fs}]*? )
)
[ \t]*
[#{fs}]
}mx

## When get_rec finds after reading a line that the record
isn't
## complete, this regexp will be used to decide whether to
read
## another line or to raise an exception.
@reading_regexp =
%r{
\A # Anchor at beginning of string.
(?:
[ \t]*
(?:
" [^"]* (?: "" [^"]* )* " |
[^"\n#{fs}]*?
)
[ \t]*
[#{fs}]
)*
[ \t]*
" [^"]* (?: "" [^"]* )*
\Z # Anchor at end of string.
}mx
end # make_regexp

end # class FastCsv

if $0 == __FILE__
FastCsv.foreach("example.csv") { |record| p record }
end
 
J

James Edward Gray II

William James ha scritto:
[...]
For my test file of 1,964,211 bytes, it's about 6.4 times as
fast.

what things is this missing wrt standard csv.rb?

Also, why you did choose to make all of the stuff (methods,
variables) at class level instead of instance ?

A more OO (and equally fast) version:

Ara.T.Howard posted a test framework for a bunch of edge cases on
Ruby Core, based on the CSV RFC (http://www.ietf.org/rfc/
rfc4180.txt). Here's that framework modified to work with your library:

Neo:~/Desktop$ cat fast_csv_tests.rb
require 'pp'
require 'csv'
require 'fast_csv'

tests = [
[
%( a,b ),
["a", "b"]
],
[
%( a,"""b""" ),
["a", "\"b\""]
],
[
%( a,"""b" ),
["a", "\"b"]
],
[
%( a,"b""" ),
["a", "b\""]
],
[
%( a,"
b""" ),
["a", "\nb\""]
],
[
%( a,"""
b" ),
["a", "\"\nb"]
],
[
%( a,"""
b
""" ),
["a", "\"\nb\n\""]
],
[
%( a,"""
b
""",
c ),
["a", "\"\nb\n\"", nil]
],
[
%( a,,, ),
["a", nil, nil, nil]
],
[
%( , ),
[nil, nil]
],
[
%( "","" ),
["", ""]
],
[
%( """" ),
["\""]
],
[
%( """","" ),
["\"",""]
],
[
%( ,"" ),
[nil,""]
],
[
%( \r,"\r" ),
[nil,"\r"]
],
[
%( "\r\n," ),
["\r\n,"]
],
[
%( "\r\n,", ),
["\r\n,", nil]
],
]

impls = CSV, FastCsv.new

tests.each_with_index do |test, idx|
input, expected = test
csv = []
impls.each do |impl|
begin
if impl == CSV
csv = impl::parse_line input.strip
else
csv = impl.parse input.strip
end
raise "FAILED" unless csv == expected
rescue => e
puts "=" * 42
puts "#{ impl }[#{ idx }] => #{ e.message } (#{ e.class })"
puts "=" * 42
puts "input:\n#{ PP::pp input.strip, '' }"
puts "csv:\n#{ PP::pp csv, '' }"
puts "expected:\n#{ PP::pp expected, '' }"
puts "=" * 42
puts
end
end
end

__END__
Neo:~/Desktop$ ruby fast_csv_tests.rb
==========================================
#<FastCsv:0x33d130>[7] => Bad csv record:
a,"""
b
""",
c
(RuntimeError)
==========================================
input:
"a,\"\"\"\nb\n\"\"\",\nc"
csv:
["a", "\"\nb\n\"", nil]
expected:
["a", "\"\nb\n\"", nil]
==========================================

==========================================
#<FastCsv:0x33d130>[8] => FAILED (RuntimeError)
==========================================
input:
"a,,,"
csv:
["a", "", "", ""]
expected:
["a", nil, nil, nil]
==========================================

==========================================
#<FastCsv:0x33d130>[9] => FAILED (RuntimeError)
==========================================
input:
","
csv:
["", ""]
expected:
[nil, nil]
==========================================

==========================================
#<FastCsv:0x33d130>[13] => FAILED (RuntimeError)
==========================================
input:
",\"\""
csv:
["", ""]
expected:
[nil, ""]
==========================================

==========================================
#<FastCsv:0x33d130>[14] => FAILED (RuntimeError)
==========================================
input:
",\"\r\""
csv:
["", "\r"]
expected:
[nil, "\r"]
==========================================

==========================================
#<FastCsv:0x33d130>[16] => FAILED (RuntimeError)
==========================================
input:
"\"\r\n,\","
csv:
["\r\n,", ""]
expected:
["\r\n,", nil]
==========================================

James Edward Gray II
 
J

James Edward Gray II

The people on Ruby Core who are trying to speed up CSV parsing
could use this as a starting point.

My latest offering to Ruby Core has been:

def parse_csv( data )
io = if data.is_a?(IO) then data else StringIO.new(data) end
line = ""

loop do
line += io.gets
parse = line.dup
parse.chomp!

csv = if parse.sub!(/\A,+/, "") then [nil] * $&.length else
Array.new end
parse.gsub!(/\G(?:^|,)(?:"((?>[^"]*)(?>""[^"]*)*)"|([^",]*))/) do
csv << if $1.nil?
if $2 == "" then nil else $2 end
else
$1.gsub('""', '"')
end
""
end

break csv if parse.empty?
end
end

Which is passing all the edge cases they have thrown at it so far and
is very similar to the speed you achieved.

James Edward Gray II
 
S

Stefan Lang

William James ha scritto:
[...]

For my test file of 1,964,211 bytes, it's about 6.4 times as
fast.

what things is this missing wrt standard csv.rb?

Also, why you did choose to make all of the stuff (methods,
variables) at class level instead of instance ?

A more OO (and equally fast) version:

Ara.T.Howard posted a test framework for a bunch of edge cases on
Ruby Core, based on the CSV RFC (http://www.ietf.org/rfc/
rfc4180.txt). Here's that framework modified to work with your
library:
...

Doesn't look good :(
I just took William James' code and converted the class
variables/methods to instance variables/methods.

Regards,
Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top