Why csv file processing is so slow?

mepython · Jan 28, 2005

I want to process csv file. Here is small program in python and ruby:

[root@taamportable GMS]# cat x.py
import csv
reader = csv.reader(file('x.csv'))
header = reader.next()
count = 0
for data in reader:
count += 1
print count

[root@taamportable GMS]# cat x.rb
require 'csv'
reader = CSV.open('x.csv', 'r')
header = reader.shift
count = 0
reader.each {|data|
count += 1
}
p count

*******************************************************
Here is processing time: As you can see ruby is way to slow. Is there
anything to do about ruby code?
*******************************************************
[root@taamportable GMS]# time python x.py
26907

real 0m0.311s
user 0m0.302s
sys 0m0.009s

[root@taamportable GMS]# time ruby x.rb
26907

real 1m48.296s
user 1m36.853s
sys 0m11.188s

Robert Klemme · Jan 28, 2005

mepython said:
I want to process csv file. Here is small program in python and ruby:

[root@taamportable GMS]# cat x.py
import csv
reader = csv.reader(file('x.csv'))
header = reader.next()
count = 0
for data in reader:
count += 1
print count

[root@taamportable GMS]# cat x.rb
require 'csv'
reader = CSV.open('x.csv', 'r')
header = reader.shift
count = 0
reader.each {|data|
count += 1
}
p count

*******************************************************
Here is processing time: As you can see ruby is way to slow. Is there
anything to do about ruby code?

First I'd try to figure whether it's IO that's slow or CSV. Did you test
with something like this:

File.open('x.csv') do |reader|
count = 0
reader.each {|data| count += 1}
p count
end

Does it make a huge difference?

Kind regards

robert

*******************************************************
[root@taamportable GMS]# time python x.py
26907

real 0m0.311s
user 0m0.302s
sys 0m0.009s

[root@taamportable GMS]# time ruby x.rb
26907

real 1m48.296s
user 1m36.853s
sys 0m11.188s

mepython · Jan 28, 2005

It is csv module: reading file seems like half the speed of python. So
real slowness is coming from csv

count = 0
File.open('x.csv') do |reader|
reader.each {|data| count += 1}
end
p count

[root@taamportable GMS]# time ruby x1.rb
26908

real 0m0.077s
user 0m0.060s
sys 0m0.016s

[root@taamportable GMS]# time python x1.py
26908

real 0m0.041s
user 0m0.032s
sys 0m0.010s

Andrew Johnson · Jan 28, 2005

Here is processing time: As you can see ruby is way to slow. Is there
anything to do about ruby code?

Well, the python library csv.py uses the underlying _csv module which
is written in C ... Ruby's standard-lib csv.rb is all Ruby. I don't
know of any csv extensions for Ruby.

regards,
andrew

mepython · Jan 28, 2005

Thanks andrew. I should have look into module before posting.

Robert Klemme · Jan 28, 2005

mepython said:
It is csv module: reading file seems like half the speed of python. So
real slowness is coming from csv

count = 0
File.open('x.csv') do |reader|
reader.each {|data| count += 1}
end
p count

[root@taamportable GMS]# time ruby x1.rb
26908

real 0m0.077s
user 0m0.060s
sys 0m0.016s

[root@taamportable GMS]# time python x1.py
26908

real 0m0.041s
user 0m0.032s
sys 0m0.010s

As a simple CSV replacement you could try this:

File.open('x.csv') do |reader|
reader.each {|line|
count += 1
data = line.split(/,/)
}
end
p count

Depending on your data that might or might not be sufficient. Regexps can
be arbitrarily sophisticated. Here's another one:

data = []
line.scan( %r{
"((?:[^\\"]|\\")*)" |
'((?:[^\\']|\\')*)' |
([^,]+)
}x ){|m| data << m.find {|x|x}}

)

robert

William James · Jan 28, 2005

Robert said:
Depending on your data that might or might not be sufficient. Regexps can
be arbitrarily sophisticated. Here's another one:

data = []
line.scan( %r{
"((?:[^\\"]|\\")*)" |
'((?:[^\\']|\\')*)' |
([^,]+)
}x ){|m| data << m.find {|x|x}}

I borrowed your regexp.

% class String
% def parse_csv
% a = self.scan(
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^,]+ )
% }x ).flatten
% a.delete(nil)
% a
% end
% end
%
% ARGF.each_line { | line |
% p line.chomp.parse_csv
% }

With this input

a,b,"foo, bar",c
"foo isn't \"bar\"",a,b
a,'"just,my,luck"',b

the output is

["a", "b", "foo, bar", "c"]
["foo isn't \\\"bar\\\"", "a", "b"]
["a", "\"just,my,luck\"", "b"]

William James · Jan 29, 2005

William said:
% class String
% def parse_csv
% a = self.scan(
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^,]+ )
% }x ).flatten
% a.delete(nil)
% a
% end
% end

To test the method parse_csv, I created a 1 megabyte file consisting of
4228 copies of

a,b,"foo, bar",c
"foo isn't \"bar\"",a,b
a,'"just,my,luck"',b
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9

Processing it using parse_csv took about 7 seconds on my computer,
which has a 866MHz pentium processor.

Ruby's standard-lib csv.rb reported an error in the file's format.

So I made a file containing 26907 copies of

111,222,333,444,555,666,777,888,999

Ruby's standard-lib csv.rb took about 35 seconds to process it;
parse_csv, about 5 seconds.

mepython · Jan 29, 2005

I got similar result with your parse_csv. This brings another issue in
my mind: This method is also in ruby so why such a huge overhead when
we use csv module vs. this method?

How can we modify so that we can pass field seperator and record
seperator as an argument?

William said:
William said:

% class String
% def parse_csv
% a = self.scan(
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^,]+ )
% }x ).flatten
% a.delete(nil)
% a
% end
% end

Click to expand...

To test the method parse_csv, I created a 1 megabyte file consisting of
4228 copies of

a,b,"foo, bar",c
"foo isn't \"bar\"",a,b
a,'"just,my,luck"',b
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9

Processing it using parse_csv took about 7 seconds on my computer,
which has a 866MHz pentium processor.

Ruby's standard-lib csv.rb reported an error in the file's format.

So I made a file containing 26907 copies of

111,222,333,444,555,666,777,888,999

Ruby's standard-lib csv.rb took about 35 seconds to process it;
parse_csv, about 5 seconds.

William James · Jan 29, 2005

mepython said:
How can we modify so that we can pass field seperator and record
seperator as an argument?

This should do it. I found that not rebuilding the regular-expression
every time parse_csv is called made it even faster.

% # Record separator.
% RS = "\n"
%
% # Set regexp for parse_csv.
% # fs is the field-separator
% def fs_is( fs )
% $csv_re = \
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^#{fs}]+ )
% }x
% end
%
% class String
% def parse_csv
% raise "Method fs_is() wasn't called." if $csv_re.nil?
% a = self.scan( $csv_re ).flatten
% a.delete(nil)
% a
% end
% end
%
% fs_is( ',' )
%
% # Set Ruby's input record-separator.
% $/ = RS
%
% ARGF.each_line { | line |
% p line.chomp.parse_csv
% }

William James · Jan 29, 2005

Improved version:

% # Record separator.
% RS = "\n"
%
% class String
% # Set regexp for parse_csv.
% # self is the field-separator
% def is_fs
% $csv_re = \
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^#{self}]+ )
% }x
% end
% def parse_csv
% raise "Method #is_fs wasn't called." if $csv_re.nil?
% self.scan( $csv_re ).flatten.compact
% end
% end
%
% ','.is_fs
%
% # Set Ruby's input record-separator.
% $/ = RS
%
% ARGF.each_line { | line |
% p line.chomp.parse_csv
% }

mepython · Jan 29, 2005

I found an error in parse_csv if field is empty, it ignores it for
example:
x,y,z
1,,2

Second line should return [1,nil,2] instead it returns [1,2].

How hard to do reverse: create csv string from list?

Thanks. I just started Ruby couple of days ago, so I am learning
instead of implementing, Sorry.

William James · Jan 30, 2005

mepython said:
I found an error in parse_csv if field is empty, it ignores it for
example:
x,y,z
1,,2

Second line should return [1,nil,2] instead it returns [1,2].

How hard to do reverse: create csv string from list?

Thanks. I just started Ruby couple of days ago, so I am learning
instead of implementing, Sorry.

1,,2 now returns [1, "", 2]

Use arry.to_csv to create csv string from array.

% ## Record separator.
% RS = "\n"
%
% class String
% # Set regexp for parse_csv.
% # self is the field-separator, which must be
% # a single character.
% def is_fs
% $csv_fs = self
% if "^" == $csv_fs
% fs = "\\^"
% else
% fs = $csv_fs
% end
% $csv_re = \
% %r! (?:
% "( [^"\\]* (?: \\.[^"\\]* )* )" |
% ( [^#{fs}]* )
% )
% [#{fs}]
% !x
% end
% def parse_csv
% raise "Method #is_fs wasn't called." if $csv_re.nil?
% (self+$csv_fs).scan( $csv_re ).flatten.compact
% end
% end
%
% class Array
% def to_csv
% raise "Method #is_fs wasn't called." if $csv_fs.nil?
% s = ''
% self.each { |x|
% x = '"'+x+'"' if x.index( $csv_fs ) or x.index( '"' )
% s += x + $csv_fs
% }
% s[0 .. -2]
% end
% end
%
%
% ",".is_fs
%
% ## Set Ruby's input record-separator.
% $/ = RS
%
% ARGF.each_line { | line |
% line.chomp!
% puts "-------------------"
% puts line
% ary = line.parse_csv
% p ary
% puts ary.to_csv
% }

Tim Sutherland · Jan 30, 2005

[...]

How hard to do reverse: create csv string from list?

Thanks. I just started Ruby couple of days ago, so I am learning
instead of implementing, Sorry.

This assumes the input is an array of lines. (Where each line is an array.)

class Array
def to_csv
map { |line|
line.map { |cell|
'"' + cell.gsub(/"/, '""') + '"'
}.join(',')
}.join("\n")
end
end

Note that literal quotes " are replaced with "".

Bertram Scharpf · Jan 30, 2005

Hi,

Am Sonntag, 30. Jan 2005, 17:35:49 +0900 schrieb Tim Sutherland:

[...]

How hard to do reverse: create csv string from list?

Thanks. I just started Ruby couple of days ago, so I am learning
instead of implementing, Sorry.

Click to expand...

This assumes the input is an array of lines. (Where each line is an array.)

class Array
def to_csv
map { |line|
line.map { |cell|
'"' + cell.gsub(/"/, '""') + '"'
}.join(',')
}.join("\n")
end
end

How about this (untested):

class Array
def to_csv sep = ';'
quo = '"'
map { |line|
line.map { |cell|
c = cell.to_s
if c.include? sep or c.include? quo
quo + c.gsub( quo, quo*2) + quo
else
c
end
}.join sep
}.join $/
end
end

Bertram

William James · Jan 31, 2005

Now assumes a quotation mark within a field is represented as ""
(previous versions assumed \" ).
Lacks one thing: cannot handle a newline within a field.

% # Record separator.
% RS = "\n"
%
% class Array
% def to_csv
% raise "Method #is_fs wasn't called." if $csv_fs.nil?
% s = ''
% self.map { |item|
% str = item.to_s
% if str.index( $csv_fs ) or /^\s|"|\s$/.match(str)
% str = '"' + str.gsub( /"/, '""' ) + '"'
% end
% str
% }.join($csv_fs)
% end
% def unescape
% self.map!{|x| x.gsub( /""/, '"' ) }
% end
% end
%
% class String
% # Set regexp for parse_csv.
% # self is the field-separator, which must be
% # a single character.
% def is_fs
% $csv_fs = self
% if "^" == $csv_fs
% fs = "\\^"
% else
% fs = $csv_fs
% end
% $csv_re = \
% ## Assumes embedded quotes are escaped as "".
% %r! \s*
% (?:
% "( [^"]* (?: "" [^"]* )* )" |
% ( .*? )
% )
% \s*
% [#{fs}]
% !x
% end
% def parse_csv
% raise "Method #is_fs wasn't called." if $csv_re.nil?
% (self+$csv_fs).scan( $csv_re ).flatten.compact.unescape
% end
% end
%
% ",".is_fs
%
% # Set Ruby's input record-separator.
% $/ = RS
%
% ARGF.each_line { | line |
% line.chomp!
% puts line
% ary = line.parse_csv
% p ary
% puts ary.to_csv
% }

William James · Feb 1, 2005

A small, fast, and (I think) complete csv parser.

Now handles newlines within fields.
A comma is now the default field-separator.

| class Array
| def to_csv
| ",".is_fs if $csv_fs.nil?
| s = ''
| self.map { |item|
| str = item.to_s
| # Quote the string if it contains the field-separator or
| # a " or a newline, or if it has leading or trailing
whitespace.
| if str.index($csv_fs) or /^\s|"|\n|\s$/.match(str)
| str = '"' + str.gsub( /"/, '""' ) + '"'
| end
| str
| }.join($csv_fs)
| end
| def unescape
| self.map{|x| x.gsub( /""/, '"' ) }
| end
| end
|
| class String
| # Set regexp for parse_csv.
| # self is the field-separator, which must be
| # a single character.
| def is_fs
| $csv_fs = self
| if "^" == $csv_fs
| fs = "\\^"
| else
| fs = $csv_fs
| end
| $csv_re = \
| ## Assumes embedded quotes are escaped as "".
| %r{ \s*
| (?:
| "( [^"]* (?: "" [^"]* )* )" |
| ( .*? )
| )
| \s*
| [#{fs}]
| }mx
| end
|
| def parse_string
| (self + $csv_fs).scan( $csv_re ).flatten.compact.unescape
| end
| end
|
| def get_rec( file )
| ",".is_fs if $csv_re.nil?
| $csv_s = ""
| begin
| if file.eof?
| raise "The csv file is malformed." if $csv_s.size>0
| return nil
| end
| $csv_s += file.gets
| end until $csv_s.count( '"' ) % 2 == 0
| $csv_s.chomp!
| $csv_s.parse_string
| end
|
|
| while rec = get_rec( ARGF )
| puts "----------------"
| puts $csv_s
| p rec
| puts rec.to_csv
|
| end

Ryan Davis · Feb 1, 2005

A small, fast, and (I think) complete csv parser.

There is test_csv.rb in the ruby tarball. Can you run your new code
against it to make sure it is complete? With good profile numbers I
doubt it'd be hard to get the slower code replaced.

Ralf Müller · Feb 1, 2005

On Sun, 30 Jan 2005 17:35:49 +0900

[...]

How hard to do reverse: create csv string from list?

Thanks. I just started Ruby couple of days ago, so I am learning
instead of implementing, Sorry.

Click to expand...

This assumes the input is an array of lines. (Where each line is an array.)

class Array
def to_csv
map { |line|
line.map { |cell|
'"' + cell.gsub(/"/, '""') + '"'
}.join(',')
}.join("\n")
end
end

Note that literal quotes " are replaced with "".

Found a Parser in a german ruby-Book by Röhrl,Schmiedl and Wyss. With a little improvement, it supports unqoted, '-quoted and "-quoted cells in any order:

#!/usr/bin/env ruby
class CSVParser
include Enumerable

QUOTED = /('|"){1,1}(.*?)\1{1,1}(,|\r?\n)/m
UNQUOTED = /()(.*?)(,|\r?\n)/m

def initialize(string)
@string = string
end

# datafields of a line are provided as an array
def each
while @string != ''
tokens = []
while @string != ''
case @string[0..0]
# empty cell
when ","
tokens << nil
@string.slice!(0..0)
next
# last cell is empty
when /\r?\n/
tokens << nil
@string.slice!(0..$&.size)
break
# complex cell
when /('|")/
pattern = QUOTED
dequote = true
# simple cell
else
pattern = UNQUOTED
dequote = false
end
# match the content
md = pattern.match(@string)
token = md[2]
# token.gsub('""','"') if dequote
tokens << token
@string.slice!(0...md[0].size)
# last cell
break if md[0][-1..-1] == "\n"
end
yield tokens
end
end
end

# =============================================================================
# MAIN ------------------------------------------------------------------------
cvs =CSVParser.new($stdin.read)
Start = "'"
End = "'\n"
Sep = "','"
cvs.each{|row|
puts Start + row[0].to_s + Sep + row.join(Sep) + End if row[2].to_i <= 4
00000 and row.last != ''
}

regards
ralf

William James · Feb 1, 2005

Ryan said:
There is test_csv.rb in the ruby tarball. Can you run your new code
against it to make sure it is complete? With good profile numbers I
doubt it'd be hard to get the slower code replaced.

Wow. test_csv.rb is beyond my comprehension. I don't know how
to use it.

I did lift a very complex test string from it to use in testing
my program. One of the fields in that csv string is defective;
I don't know whether that was intentional or not:

"\r\n"\r\nNaHi,

The " in the field isn't doubled, and the field doesn't end
with a quote.

Incidentally, when my program converts that string to an array
and then back to a csv string, it's not the same as
the original string because ,"", is shortened to ,, .

I corrected a minor bug in my code by moving
",".is_fs if $csv_fs.nil?
to its proper location.

The program conforms to the csv specification at this site:
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
and it handles the sample csv records shown there.

All my program can do is read a text file containing csv records,
convert those records (strings) into arrays of strings, and
convert the arrays back into csv strings. I suppose that the
csv library that comes with Ruby may do more than that.

% ## Read, parse, and create csv records.
% ## Has a minor bug fix; discard previous versions.
% ## 2005-02-01.
%
% class Array
% def to_csv
% ",".is_fs if $csv_fs.nil?
% s = ''
% self.map { |item|
% str = item.to_s
% # Quote the string if it contains the field-separator or
% # a " or a newline, or if it has leading or trailing
whitespace.
% if str.index($csv_fs) or /^\s|"|\n|\s$/.match(str)
% str = '"' + str.gsub( /"/, '""' ) + '"'
% end
% str
% }.join($csv_fs)
% end
% def unescape
% self.map{|x| x.gsub( /""/, '"' ) }
% end
% end
%
% class String
% # Set regexp for parse_csv.
% # self is the field-separator, which must be
% # a single character.
% def is_fs
% $csv_fs = self
% if "^" == $csv_fs
% fs = "\\^"
% else
% fs = $csv_fs
% end
% $csv_re = \
% ## Assumes embedded quotes are escaped as "".
% %r{ \s*
% (?:
% "( [^"]* (?: "" [^"]* )* )" |
% ( .*? )
% )
% \s*
% [#{fs}]
% }mx
% end
%
% def parse_string
% ",".is_fs if $csv_fs.nil?
% (self + $csv_fs).scan( $csv_re ).flatten.compact.unescape
% end
% end
%
% def get_rec( file )
% $csv_s = ""
% begin
% if file.eof?
% raise "The csv file is malformed." if $csv_s.size>0
% return nil
% end
% $csv_s += file.gets
% end until $csv_s.count( '"' ) % 2 == 0
% $csv_s.chomp!
% $csv_s.parse_string
% end
%
%
% # while rec = get_rec( ARGF )
% # puts "----------------"
% # puts $csv_s
% # p rec
% # puts rec.to_csv
% # end
%
% ## Here is my breakdown of the test string from test-csv.rb.
% # foo,
% # """foo""",
% # "foo,bar",
% # """""",
% # "",
% # ,
% # "\r",
% # "\r\n""\r\nNaHi", <---<< Corrected.
% # """Na""",
% # "Na,Hi",
% # "\r.\n",
% # "\r\n\n",
% # """",
% # "\n",
% # "\r\n"
%
% # Original.
% csvStr = ("foo,!!!foo!!!,!foo,bar!,!!!!!!,!!,," +
% "!\r!,!\r\n!\r\nNaHi,!!!Na!!!,!Na,Hi!," +
% "!\r.\n!,!\r\n\n!,!!!!,!\n!,!\r\n!").gsub('!', '"')
%
% # Corrected?
% csvStr = ("foo,!!!foo!!!,!foo,bar!,!!!!!!,!!,," +
% "!\r!,!\r\n!!\r\nNaHi!,!!!Na!!!,!Na,Hi!," +
% "!\r.\n!,!\r\n\n!,!!!!,!\n!,!\r\n!").gsub('!', '"')
%
% p csvStr
% arry = csvStr.parse_string
% p arry
% newCsvStr = arry.to_csv
% p newCsvStr
% arry2 = newCsvStr.parse_string
% puts "Arrays match." if arry == arry2

Why is boost sg_set so slow on ordered insertions?	0	Aug 13, 2012
read CSV file using csv library	18	Jul 21, 2008
Multiline (block) CSV file processing	9	Jan 10, 2008
Obtain 'CSV' data	11	Jan 23, 2011
Why is this WordPress comments form not submitting?	1	Jan 12, 2020
Update a particular Field in a csv file	1	Apr 15, 2010
CSV confusion newbie question	1	Dec 6, 2009
writing a csv file	1	Nov 12, 2012

Why csv file processing is so slow?

mepython

Robert Klemme

mepython

Andrew Johnson

mepython

Robert Klemme

William James

William James

mepython

William James

William James

mepython

William James

Tim Sutherland

Bertram Scharpf

William James

William James

Ryan Davis

Ralf Müller

William James

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads