Faster CSV parsing

William James · Oct 30, 2005

## Read, parse, and create csv records.

# The program conforms to the csv specification at this site:
# http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
# The only extra is that you can change the field-separator.
# For a field-separator other than a comma, for example
# a semicolon:
# Csv.fs=";"
#
# After a record has been read and parsed,
# Csv.string contains the record in raw string format.
#

class Csv

def Csv.unescape( array )
array.map{|x| x.gsub( /""/, '"' ) }
end

@@fs = ","

# Set regexp for parse.
# @@fs is the field-separator, which must be
# a single character.
def Csv.make_regexp
fs = @@fs
if "^" == fs
fs = "\\^"
end

@@regexp =
## Assumes embedded quotes are escaped as "".
%r{
\G ## Anchor at end of previous match.
[ \t]* ## Leading spaces or tabs are discarded.
(?:
## For a field in quotes.
" ( [^"]* (?: "" [^"]* )* ) " |
## For a field not in quotes.
( [^"\n#{fs}]*? )
)
[ \t]*
[#{fs}]
}mx

## When get_rec finds after reading a line that the record isn't
## complete, this regexp will be used to decide whether to read
## another line or to raise an exception.
@@reading_regexp =
%r{
\A # Anchor at beginning of string.
(?:
[ \t]*
(?:
" [^"]* (?: "" [^"]* )* " |
[^"\n#{fs}]*?
)
[ \t]*
[#{fs}]
)*
[ \t]*
" [^"]* (?: "" [^"]* )*
\Z # Anchor at end of string.
}mx

end # def make_regexp

Csv.make_regexp

def Csv.parse( s )
ary = (s + @@fs).scan( @@regexp )
raise "\nBad csv record:\n#{s}\n" if $' != ""
Csv.unescape( ary.flatten.compact )
end

@@string = nil

def Csv.get_rec( file )
return nil if file.eof?
@@string = ""
begin
if @@string.size>0
raise "\nBad record:\n#{@@string}\n" if
@@string !~ @@reading_regexp
raise "\nPremature end of csv file." if file.eof?
end
@@string += file.gets
end until @@string.count( '"' ) % 2 == 0
@@string.chomp!
Csv.parse( @@string )
end

def Csv.string
@@string
end

def Csv.fs=( s )
raise "\nCsv.fs must be a single character.\n" if s.size != 1
@@fs = s
Csv.make_regexp
end

def Csv.fs
@@fs
end

def Csv.to_csv( array )
s = ''
array.map { |item|
str = item.to_s
# Quote the string if it contains the field-separator or
# a " or a newline or a carriage-return, or if it has leading or
# trailing whitespace.
if str.index(@@fs) or /^\s|["\r\n]|\s$/.match(str)
str = '"' + str.gsub( /"/, '""' ) + '"'
end
str
}.join(@@fs)
end

end # class Csv

Gavin Kistner · Oct 30, 2005

## Read, parse, and create csv records.

# The program conforms to the csv specification at this site:
# http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
# The only extra is that you can change the field-separator.
# For a field-separator other than a comma, for example
# a semicolon:
# Csv.fs=";"
#
# After a record has been read and parsed,
# Csv.string contains the record in raw string format.

Thanks for sharing this. You claim 'faster' CSV parsing. Faster than
what, and by how much? Got any benchmark results to share?

William James · Oct 30, 2005

Gavin said:
Thanks for sharing this. You claim 'faster' CSV parsing. Faster than
what,

Than Ruby's standard-lib csv.rb.

and by how much? Got any benchmark results to share?

For my test file of 1,964,211 bytes, it's about 6.4 times as fast.

Robert Klemme · Oct 30, 2005

William James said:
Than Ruby's standard-lib csv.rb.

For my test file of 1,964,211 bytes, it's about 6.4 times as fast.

I didn't look too closely at it nor did I test it but your use of class
variables seems not necessary here. I'd prefer to transform them into
instance variables. Makes your code much more robust...

Kind regards

robert

gabriele renzi · Oct 30, 2005

William James ha scritto:

Than Ruby's standard-lib csv.rb.

For my test file of 1,964,211 bytes, it's about 6.4 times as fast.

what things is this missing wrt standard csv.rb?

Also, why you did choose to make all of the stuff (methods, variables)
at class level instead of instance ?

Stefan Lang · Oct 30, 2005

--Boundary-00=_98NZDJwsHvMffL+
Content-Type: Multipart/Mixed;
boundary="Boundary-00=_98NZDJwsHvMffL+";
charset="iso-8859-1"

William James ha scritto: [...]

For my test file of 1,964,211 bytes, it's about 6.4 times as
fast.

Click to expand...

what things is this missing wrt standard csv.rb?

Also, why you did choose to make all of the stuff (methods,
variables) at class level instead of instance ?

A more OO (and equally fast) version:

## Read, parse, and create csv records.

# The program conforms to the csv specification at this site:
# http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
# The only extra is that you can change the field-separator.
# For a field-separator other than a comma, for example
# a semicolon:
# csv.fs=";" # csv is a FastCsv instance
#
# After a record has been read and parsed,
# csv.string contains the record in raw string format.
#

class FastCsv

def self.foreach(filename)
csv = self.new
open filename do |file|
while record = csv.get_rec(file)
yield record
end
end
end

def initialize(fs = ",")
self.fs = fs
@string = nil
end

def fs=(s)
raise "fs must be a single character." if s.size != 1
@fs = s.dup
make_regexp
end

def fs
@fs.dup
end

def parse(s)
ary = (s + @fs).scan(@regexp)
raise "Bad csv record:\n#{s}\n" if $' != ""
unescape(ary.flatten.compact)
end

def get_rec(file)
return nil if file.eof?
@string = ""
begin
if @string.size > 0
raise "Bad record:\n#@string\n" if @string !~
@reading_regexp
raise "Premature end of csv file." if file.eof?
end
@string += file.gets
end until @string.count('"') % 2 == 0
@string.chomp!
parse(@string)
end

def string
@string
end

def to_csv(array)
s = ''
array.map { |item|
str = item.to_s
# Quote the string if it contains the field-separator or
# a " or a newline or a carriage-return, or if it has
leading or
# trailing whitespace.
if str.index(@fs) or /^\s|["\r\n]|\s$/.match(str)
str = '"' + str.gsub( /"/, '""' ) + '"'
end
str
}.join(@fs)
end

private

def unescape(array)
array.map { |x| x.gsub(/""/, '"') }
end

# Set regexp for parse.
# @fs is the field-separator, which must be
# a single character.
def make_regexp
fs = @fs
if "^" == fs
fs = "\\^"
end

@regexp =
## Assumes embedded quotes are escaped as "".
%r{
\G ## Anchor at end of previous match.
[ \t]* ## Leading spaces or tabs are discarded.
(?:
## For a field in quotes.
" ( [^"]* (?: "" [^"]* )* ) " |
## For a field not in quotes.
( [^"\n#{fs}]*? )
)
[ \t]*
[#{fs}]
}mx

## When get_rec finds after reading a line that the record
isn't
## complete, this regexp will be used to decide whether to
read
## another line or to raise an exception.
@reading_regexp =
%r{
\A # Anchor at beginning of string.
(?:
[ \t]*
(?:
" [^"]* (?: "" [^"]* )* " |
[^"\n#{fs}]*?
)
[ \t]*
[#{fs}]
)*
[ \t]*
" [^"]* (?: "" [^"]* )*
\Z # Anchor at end of string.
}mx
end # make_regexp

end # class FastCsv

if $0 == __FILE__
FastCsv.foreach("example.csv") { |record| p record }
end
--Boundary-00=_98NZDJwsHvMffL+--

William James · Oct 30, 2005

Robert said:
I didn't look too closely at it nor did I test it but your use of class
variables seems not necessary here. I'd prefer to transform them into
instance variables. Makes your code much more robust...

Kind regards

robert

If anyone wants to make it more robust, he is free to do so.
I have little need for csv parsing, and I don't want to spend
much more time on this.

The people on Ruby Core who are trying to speed up CSV parsing
could use this as a starting point.

Stefan Lang · Oct 30, 2005

I'm sending this message a second time. Seems that my client
has messed up the first. Sorry for the noise.

William James ha scritto: [...]

For my test file of 1,964,211 bytes, it's about 6.4 times as
fast.

Click to expand...

what things is this missing wrt standard csv.rb?

Also, why you did choose to make all of the stuff (methods,
variables) at class level instead of instance ?

A more OO (and equally fast) version:

## Read, parse, and create csv records.

# The program conforms to the csv specification at this site:
# http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
# The only extra is that you can change the field-separator.
# For a field-separator other than a comma, for example
# a semicolon:
# csv.fs=";" # csv is a FastCsv instance
#
# After a record has been read and parsed,
# csv.string contains the record in raw string format.
#

class FastCsv

def self.foreach(filename)
csv = self.new
open filename do |file|
while record = csv.get_rec(file)
yield record
end
end
end

def initialize(fs = ",")
self.fs = fs
@string = nil
end

def fs=(s)
raise "fs must be a single character." if s.size != 1
@fs = s.dup
make_regexp
end

def fs
@fs.dup
end

def parse(s)
ary = (s + @fs).scan(@regexp)
raise "Bad csv record:\n#{s}\n" if $' != ""
unescape(ary.flatten.compact)
end

def get_rec(file)
return nil if file.eof?
@string = ""
begin
if @string.size > 0
raise "Bad record:\n#@string\n" if @string !~
@reading_regexp
raise "Premature end of csv file." if file.eof?
end
@string += file.gets
end until @string.count('"') % 2 == 0
@string.chomp!
parse(@string)
end

def string
@string
end

def to_csv(array)
s = ''
array.map { |item|
str = item.to_s
# Quote the string if it contains the field-separator or
# a " or a newline or a carriage-return, or if it has
leading or
# trailing whitespace.
if str.index(@fs) or /^\s|["\r\n]|\s$/.match(str)
str = '"' + str.gsub( /"/, '""' ) + '"'
end
str
}.join(@fs)
end

private

def unescape(array)
array.map { |x| x.gsub(/""/, '"') }
end

# Set regexp for parse.
# @fs is the field-separator, which must be
# a single character.
def make_regexp
fs = @fs
if "^" == fs
fs = "\\^"
end

@regexp =
## Assumes embedded quotes are escaped as "".
%r{
\G ## Anchor at end of previous match.
[ \t]* ## Leading spaces or tabs are discarded.
(?:
## For a field in quotes.
" ( [^"]* (?: "" [^"]* )* ) " |
## For a field not in quotes.
( [^"\n#{fs}]*? )
)
[ \t]*
[#{fs}]
}mx

## When get_rec finds after reading a line that the record
isn't
## complete, this regexp will be used to decide whether to
read
## another line or to raise an exception.
@reading_regexp =
%r{
\A # Anchor at beginning of string.
(?:
[ \t]*
(?:
" [^"]* (?: "" [^"]* )* " |
[^"\n#{fs}]*?
)
[ \t]*
[#{fs}]
)*
[ \t]*
" [^"]* (?: "" [^"]* )*
\Z # Anchor at end of string.
}mx
end # make_regexp

end # class FastCsv

if $0 == __FILE__
FastCsv.foreach("example.csv") { |record| p record }
end

James Edward Gray II · Oct 30, 2005

William James ha scritto:
[...]

For my test file of 1,964,211 bytes, it's about 6.4 times as
fast.

Click to expand...

what things is this missing wrt standard csv.rb?

Also, why you did choose to make all of the stuff (methods,
variables) at class level instead of instance ?

Click to expand...

A more OO (and equally fast) version:

Ara.T.Howard posted a test framework for a bunch of edge cases on
Ruby Core, based on the CSV RFC (http://www.ietf.org/rfc/
rfc4180.txt). Here's that framework modified to work with your library:

Neo:~/Desktop$ cat fast_csv_tests.rb
require 'pp'
require 'csv'
require 'fast_csv'

tests = [
[
%( a,b ),
["a", "b"]
],
[
%( a,"""b""" ),
["a", "\"b\""]
],
[
%( a,"""b" ),
["a", "\"b"]
],
[
%( a,"b""" ),
["a", "b\""]
],
[
%( a,"
b""" ),
["a", "\nb\""]
],
[
%( a,"""
b" ),
["a", "\"\nb"]
],
[
%( a,"""
b
""" ),
["a", "\"\nb\n\""]
],
[
%( a,"""
b
""",
c ),
["a", "\"\nb\n\"", nil]
],
[
%( a,,, ),
["a", nil, nil, nil]
],
[
%( , ),
[nil, nil]
],
[
%( "","" ),
["", ""]
],
[
%( """" ),
["\""]
],
[
%( """","" ),
["\"",""]
],
[
%( ,"" ),
[nil,""]
],
[
%( \r,"\r" ),
[nil,"\r"]
],
[
%( "\r\n," ),
["\r\n,"]
],
[
%( "\r\n,", ),
["\r\n,", nil]
],
]

impls = CSV, FastCsv.new

tests.each_with_index do |test, idx|
input, expected = test
csv = []
impls.each do |impl|
begin
if impl == CSV
csv = impl:

arse_line input.strip
else
csv = impl.parse input.strip
end
raise "FAILED" unless csv == expected
rescue => e
puts "=" * 42
puts "#{ impl }[#{ idx }] => #{ e.message } (#{ e.class })"
puts "=" * 42
puts "input:\n#{ PP:

p input.strip, '' }"
puts "csv:\n#{ PP:

p csv, '' }"
puts "expected:\n#{ PP:

p expected, '' }"
puts "=" * 42
puts
end
end
end

__END__
Neo:~/Desktop$ ruby fast_csv_tests.rb
==========================================
#<FastCsv:0x33d130>[7] => Bad csv record:
a,"""
b
""",
c
(RuntimeError)
==========================================
input:
"a,\"\"\"\nb\n\"\"\",\nc"
csv:
["a", "\"\nb\n\"", nil]
expected:
["a", "\"\nb\n\"", nil]
==========================================

==========================================
#<FastCsv:0x33d130>[8] => FAILED (RuntimeError)
==========================================
input:
"a,,,"
csv:
["a", "", "", ""]
expected:
["a", nil, nil, nil]
==========================================

==========================================
#<FastCsv:0x33d130>[9] => FAILED (RuntimeError)
==========================================
input:
","
csv:
["", ""]
expected:
[nil, nil]
==========================================

==========================================
#<FastCsv:0x33d130>[13] => FAILED (RuntimeError)
==========================================
input:
",\"\""
csv:
["", ""]
expected:
[nil, ""]
==========================================

==========================================
#<FastCsv:0x33d130>[14] => FAILED (RuntimeError)
==========================================
input:
",\"\r\""
csv:
["", "\r"]
expected:
[nil, "\r"]
==========================================

==========================================
#<FastCsv:0x33d130>[16] => FAILED (RuntimeError)
==========================================
input:
"\"\r\n,\","
csv:
["\r\n,", ""]
expected:
["\r\n,", nil]
==========================================

James Edward Gray II

James Edward Gray II · Oct 30, 2005

The people on Ruby Core who are trying to speed up CSV parsing
could use this as a starting point.

My latest offering to Ruby Core has been:

def parse_csv( data )
io = if data.is_a?(IO) then data else StringIO.new(data) end
line = ""

loop do
line += io.gets
parse = line.dup
parse.chomp!

csv = if parse.sub!(/\A,+/, "") then [nil] * $&.length else
Array.new end
parse.gsub!(/\G(?:^|,)(?:"((?>[^"]*)(?>""[^"]*)*)"|([^",]*))/) do
csv << if $1.nil?
if $2 == "" then nil else $2 end
else
$1.gsub('""', '"')
end
""
end

break csv if parse.empty?
end
end

Which is passing all the edge cases they have thrown at it so far and
is very similar to the speed you achieved.

James Edward Gray II

Stefan Lang · Oct 30, 2005

William James ha scritto:
[...]

For my test file of 1,964,211 bytes, it's about 6.4 times as
fast.

what things is this missing wrt standard csv.rb?

Also, why you did choose to make all of the stuff (methods,
variables) at class level instead of instance ?

Click to expand...

A more OO (and equally fast) version:

Click to expand...

Ara.T.Howard posted a test framework for a bunch of edge cases on
Ruby Core, based on the CSV RFC (http://www.ietf.org/rfc/
rfc4180.txt). Here's that framework modified to work with your
library:

...

Doesn't look good

I just took William James' code and converted the class
variables/methods to instance variables/methods.

Regards,
Stefan

Fastest CSV parsing?	8	Aug 16, 2007
Writing CSV file from Array	0	Jun 3, 2012
Python point location of intersect between two lines	0	Feb 28, 2018
1.9 CSV Parsing Issues	5	Nov 4, 2010
Using python recursion to calculate the Parenthesis part not working	4	Feb 5, 2023
Need help with this script	4	Mar 12, 2023
Update a particular Field in a csv file	1	Apr 15, 2010
Issue with textbox script?	0	Sep 5, 2022

Faster CSV parsing

William James

Gavin Kistner

William James

Robert Klemme

gabriele renzi

Stefan Lang

William James

Stefan Lang

James Edward Gray II

James Edward Gray II

Stefan Lang

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads