Why csv file processing is so slow?

M

mepython

I want to process csv file. Here is small program in python and ruby:

[root@taamportable GMS]# cat x.py
import csv
reader = csv.reader(file('x.csv'))
header = reader.next()
count = 0
for data in reader:
count += 1
print count



[root@taamportable GMS]# cat x.rb
require 'csv'
reader = CSV.open('x.csv', 'r')
header = reader.shift
count = 0
reader.each {|data|
count += 1
}
p count

*******************************************************
Here is processing time: As you can see ruby is way to slow. Is there
anything to do about ruby code?
*******************************************************
[root@taamportable GMS]# time python x.py
26907

real 0m0.311s
user 0m0.302s
sys 0m0.009s


[root@taamportable GMS]# time ruby x.rb
26907

real 1m48.296s
user 1m36.853s
sys 0m11.188s
 
R

Robert Klemme

mepython said:
I want to process csv file. Here is small program in python and ruby:

[root@taamportable GMS]# cat x.py
import csv
reader = csv.reader(file('x.csv'))
header = reader.next()
count = 0
for data in reader:
count += 1
print count



[root@taamportable GMS]# cat x.rb
require 'csv'
reader = CSV.open('x.csv', 'r')
header = reader.shift
count = 0
reader.each {|data|
count += 1
}
p count

*******************************************************
Here is processing time: As you can see ruby is way to slow. Is there
anything to do about ruby code?

First I'd try to figure whether it's IO that's slow or CSV. Did you test
with something like this:

File.open('x.csv') do |reader|
count = 0
reader.each {|data| count += 1}
p count
end

Does it make a huge difference?

Kind regards

robert
*******************************************************
[root@taamportable GMS]# time python x.py
26907

real 0m0.311s
user 0m0.302s
sys 0m0.009s


[root@taamportable GMS]# time ruby x.rb
26907

real 1m48.296s
user 1m36.853s
sys 0m11.188s
 
M

mepython

It is csv module: reading file seems like half the speed of python. So
real slowness is coming from csv

count = 0
File.open('x.csv') do |reader|
reader.each {|data| count += 1}
end
p count


[root@taamportable GMS]# time ruby x1.rb
26908

real 0m0.077s
user 0m0.060s
sys 0m0.016s


[root@taamportable GMS]# time python x1.py
26908

real 0m0.041s
user 0m0.032s
sys 0m0.010s
 
A

Andrew Johnson

Here is processing time: As you can see ruby is way to slow. Is there
anything to do about ruby code?

Well, the python library csv.py uses the underlying _csv module which
is written in C ... Ruby's standard-lib csv.rb is all Ruby. I don't
know of any csv extensions for Ruby.

regards,
andrew
 
R

Robert Klemme

mepython said:
It is csv module: reading file seems like half the speed of python. So
real slowness is coming from csv

count = 0
File.open('x.csv') do |reader|
reader.each {|data| count += 1}
end
p count


[root@taamportable GMS]# time ruby x1.rb
26908

real 0m0.077s
user 0m0.060s
sys 0m0.016s


[root@taamportable GMS]# time python x1.py
26908

real 0m0.041s
user 0m0.032s
sys 0m0.010s

As a simple CSV replacement you could try this:

File.open('x.csv') do |reader|
reader.each {|line|
count += 1
data = line.split(/,/)
}
end
p count

Depending on your data that might or might not be sufficient. Regexps can
be arbitrarily sophisticated. Here's another one:

data = []
line.scan( %r{
"((?:[^\\"]|\\")*)" |
'((?:[^\\']|\\')*)' |
([^,]+)
}x ){|m| data << m.find {|x|x}}

:))

robert
 
W

William James

Robert said:
Depending on your data that might or might not be sufficient. Regexps can
be arbitrarily sophisticated. Here's another one:

data = []
line.scan( %r{
"((?:[^\\"]|\\")*)" |
'((?:[^\\']|\\')*)' |
([^,]+)
}x ){|m| data << m.find {|x|x}}


I borrowed your regexp.

% class String
% def parse_csv
% a = self.scan(
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^,]+ )
% }x ).flatten
% a.delete(nil)
% a
% end
% end
%
% ARGF.each_line { | line |
% p line.chomp.parse_csv
% }

With this input

a,b,"foo, bar",c
"foo isn't \"bar\"",a,b
a,'"just,my,luck"',b

the output is

["a", "b", "foo, bar", "c"]
["foo isn't \\\"bar\\\"", "a", "b"]
["a", "\"just,my,luck\"", "b"]
 
W

William James

William said:
% class String
% def parse_csv
% a = self.scan(
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^,]+ )
% }x ).flatten
% a.delete(nil)
% a
% end
% end

To test the method parse_csv, I created a 1 megabyte file consisting of
4228 copies of

a,b,"foo, bar",c
"foo isn't \"bar\"",a,b
a,'"just,my,luck"',b
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9

Processing it using parse_csv took about 7 seconds on my computer,
which has a 866MHz pentium processor.

Ruby's standard-lib csv.rb reported an error in the file's format.

So I made a file containing 26907 copies of

111,222,333,444,555,666,777,888,999

Ruby's standard-lib csv.rb took about 35 seconds to process it;
parse_csv, about 5 seconds.
 
M

mepython

I got similar result with your parse_csv. This brings another issue in
my mind: This method is also in ruby so why such a huge overhead when
we use csv module vs. this method?

How can we modify so that we can pass field seperator and record
seperator as an argument?

William said:
William said:
% class String
% def parse_csv
% a = self.scan(
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^,]+ )
% }x ).flatten
% a.delete(nil)
% a
% end
% end

To test the method parse_csv, I created a 1 megabyte file consisting of
4228 copies of

a,b,"foo, bar",c
"foo isn't \"bar\"",a,b
a,'"just,my,luck"',b
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9

Processing it using parse_csv took about 7 seconds on my computer,
which has a 866MHz pentium processor.

Ruby's standard-lib csv.rb reported an error in the file's format.

So I made a file containing 26907 copies of

111,222,333,444,555,666,777,888,999

Ruby's standard-lib csv.rb took about 35 seconds to process it;
parse_csv, about 5 seconds.
 
W

William James

mepython said:
How can we modify so that we can pass field seperator and record
seperator as an argument?

This should do it. I found that not rebuilding the regular-expression
every time parse_csv is called made it even faster.


% # Record separator.
% RS = "\n"
%
% # Set regexp for parse_csv.
% # fs is the field-separator
% def fs_is( fs )
% $csv_re = \
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^#{fs}]+ )
% }x
% end
%
% class String
% def parse_csv
% raise "Method fs_is() wasn't called." if $csv_re.nil?
% a = self.scan( $csv_re ).flatten
% a.delete(nil)
% a
% end
% end
%
% fs_is( ',' )
%
% # Set Ruby's input record-separator.
% $/ = RS
%
% ARGF.each_line { | line |
% p line.chomp.parse_csv
% }
 
W

William James

Improved version:

% # Record separator.
% RS = "\n"
%
% class String
% # Set regexp for parse_csv.
% # self is the field-separator
% def is_fs
% $csv_re = \
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^#{self}]+ )
% }x
% end
% def parse_csv
% raise "Method #is_fs wasn't called." if $csv_re.nil?
% self.scan( $csv_re ).flatten.compact
% end
% end
%
% ','.is_fs
%
% # Set Ruby's input record-separator.
% $/ = RS
%
% ARGF.each_line { | line |
% p line.chomp.parse_csv
% }
 
M

mepython

I found an error in parse_csv if field is empty, it ignores it for
example:
x,y,z
1,,2

Second line should return [1,nil,2] instead it returns [1,2].

How hard to do reverse: create csv string from list?

Thanks. I just started Ruby couple of days ago, so I am learning
instead of implementing, Sorry.
 
W

William James

mepython said:
I found an error in parse_csv if field is empty, it ignores it for
example:
x,y,z
1,,2

Second line should return [1,nil,2] instead it returns [1,2].

How hard to do reverse: create csv string from list?

Thanks. I just started Ruby couple of days ago, so I am learning
instead of implementing, Sorry.

1,,2 now returns [1, "", 2]

Use arry.to_csv to create csv string from array.

% ## Record separator.
% RS = "\n"
%
% class String
% # Set regexp for parse_csv.
% # self is the field-separator, which must be
% # a single character.
% def is_fs
% $csv_fs = self
% if "^" == $csv_fs
% fs = "\\^"
% else
% fs = $csv_fs
% end
% $csv_re = \
% %r! (?:
% "( [^"\\]* (?: \\.[^"\\]* )* )" |
% ( [^#{fs}]* )
% )
% [#{fs}]
% !x
% end
% def parse_csv
% raise "Method #is_fs wasn't called." if $csv_re.nil?
% (self+$csv_fs).scan( $csv_re ).flatten.compact
% end
% end
%
% class Array
% def to_csv
% raise "Method #is_fs wasn't called." if $csv_fs.nil?
% s = ''
% self.each { |x|
% x = '"'+x+'"' if x.index( $csv_fs ) or x.index( '"' )
% s += x + $csv_fs
% }
% s[0 .. -2]
% end
% end
%
%
% ",".is_fs
%
% ## Set Ruby's input record-separator.
% $/ = RS
%
% ARGF.each_line { | line |
% line.chomp!
% puts "-------------------"
% puts line
% ary = line.parse_csv
% p ary
% puts ary.to_csv
% }
 
T

Tim Sutherland

[...]
How hard to do reverse: create csv string from list?

Thanks. I just started Ruby couple of days ago, so I am learning
instead of implementing, Sorry.

This assumes the input is an array of lines. (Where each line is an array.)

class Array
def to_csv
map { |line|
line.map { |cell|
'"' + cell.gsub(/"/, '""') + '"'
}.join(',')
}.join("\n")
end
end

Note that literal quotes " are replaced with "".
 
B

Bertram Scharpf

Hi,

Am Sonntag, 30. Jan 2005, 17:35:49 +0900 schrieb Tim Sutherland:
[...]
How hard to do reverse: create csv string from list?

Thanks. I just started Ruby couple of days ago, so I am learning
instead of implementing, Sorry.

This assumes the input is an array of lines. (Where each line is an array.)

class Array
def to_csv
map { |line|
line.map { |cell|
'"' + cell.gsub(/"/, '""') + '"'
}.join(',')
}.join("\n")
end
end

How about this (untested):

class Array
def to_csv sep = ';'
quo = '"'
map { |line|
line.map { |cell|
c = cell.to_s
if c.include? sep or c.include? quo
quo + c.gsub( quo, quo*2) + quo
else
c
end
}.join sep
}.join $/
end
end

Bertram
 
W

William James

Now assumes a quotation mark within a field is represented as ""
(previous versions assumed \" ).
Lacks one thing: cannot handle a newline within a field.


% # Record separator.
% RS = "\n"
%
% class Array
% def to_csv
% raise "Method #is_fs wasn't called." if $csv_fs.nil?
% s = ''
% self.map { |item|
% str = item.to_s
% if str.index( $csv_fs ) or /^\s|"|\s$/.match(str)
% str = '"' + str.gsub( /"/, '""' ) + '"'
% end
% str
% }.join($csv_fs)
% end
% def unescape
% self.map!{|x| x.gsub( /""/, '"' ) }
% end
% end
%
% class String
% # Set regexp for parse_csv.
% # self is the field-separator, which must be
% # a single character.
% def is_fs
% $csv_fs = self
% if "^" == $csv_fs
% fs = "\\^"
% else
% fs = $csv_fs
% end
% $csv_re = \
% ## Assumes embedded quotes are escaped as "".
% %r! \s*
% (?:
% "( [^"]* (?: "" [^"]* )* )" |
% ( .*? )
% )
% \s*
% [#{fs}]
% !x
% end
% def parse_csv
% raise "Method #is_fs wasn't called." if $csv_re.nil?
% (self+$csv_fs).scan( $csv_re ).flatten.compact.unescape
% end
% end
%
% ",".is_fs
%
% # Set Ruby's input record-separator.
% $/ = RS
%
% ARGF.each_line { | line |
% line.chomp!
% puts line
% ary = line.parse_csv
% p ary
% puts ary.to_csv
% }
 
W

William James

A small, fast, and (I think) complete csv parser.

Now handles newlines within fields.
A comma is now the default field-separator.


| class Array
| def to_csv
| ",".is_fs if $csv_fs.nil?
| s = ''
| self.map { |item|
| str = item.to_s
| # Quote the string if it contains the field-separator or
| # a " or a newline, or if it has leading or trailing
whitespace.
| if str.index($csv_fs) or /^\s|"|\n|\s$/.match(str)
| str = '"' + str.gsub( /"/, '""' ) + '"'
| end
| str
| }.join($csv_fs)
| end
| def unescape
| self.map{|x| x.gsub( /""/, '"' ) }
| end
| end
|
| class String
| # Set regexp for parse_csv.
| # self is the field-separator, which must be
| # a single character.
| def is_fs
| $csv_fs = self
| if "^" == $csv_fs
| fs = "\\^"
| else
| fs = $csv_fs
| end
| $csv_re = \
| ## Assumes embedded quotes are escaped as "".
| %r{ \s*
| (?:
| "( [^"]* (?: "" [^"]* )* )" |
| ( .*? )
| )
| \s*
| [#{fs}]
| }mx
| end
|
| def parse_string
| (self + $csv_fs).scan( $csv_re ).flatten.compact.unescape
| end
| end
|
| def get_rec( file )
| ",".is_fs if $csv_re.nil?
| $csv_s = ""
| begin
| if file.eof?
| raise "The csv file is malformed." if $csv_s.size>0
| return nil
| end
| $csv_s += file.gets
| end until $csv_s.count( '"' ) % 2 == 0
| $csv_s.chomp!
| $csv_s.parse_string
| end
|
|
| while rec = get_rec( ARGF )
| puts "----------------"
| puts $csv_s
| p rec
| puts rec.to_csv
|
| end
 
R

Ryan Davis

A small, fast, and (I think) complete csv parser.

There is test_csv.rb in the ruby tarball. Can you run your new code
against it to make sure it is complete? With good profile numbers I
doubt it'd be hard to get the slower code replaced.
 
R

Ralf Müller

On Sun, 30 Jan 2005 17:35:49 +0900
[...]
How hard to do reverse: create csv string from list?

Thanks. I just started Ruby couple of days ago, so I am learning
instead of implementing, Sorry.

This assumes the input is an array of lines. (Where each line is an array.)

class Array
def to_csv
map { |line|
line.map { |cell|
'"' + cell.gsub(/"/, '""') + '"'
}.join(',')
}.join("\n")
end
end

Note that literal quotes " are replaced with "".
Found a Parser in a german ruby-Book by Röhrl,Schmiedl and Wyss. With a little improvement, it supports unqoted, '-quoted and "-quoted cells in any order:

#!/usr/bin/env ruby
class CSVParser
include Enumerable

QUOTED = /('|"){1,1}(.*?)\1{1,1}(,|\r?\n)/m
UNQUOTED = /()(.*?)(,|\r?\n)/m

def initialize(string)
@string = string
end

# datafields of a line are provided as an array
def each
while @string != ''
tokens = []
while @string != ''
case @string[0..0]
# empty cell
when ","
tokens << nil
@string.slice!(0..0)
next
# last cell is empty
when /\r?\n/
tokens << nil
@string.slice!(0..$&.size)
break
# complex cell
when /('|")/
pattern = QUOTED
dequote = true
# simple cell
else
pattern = UNQUOTED
dequote = false
end
# match the content
md = pattern.match(@string)
token = md[2]
# token.gsub('""','"') if dequote
tokens << token
@string.slice!(0...md[0].size)
# last cell
break if md[0][-1..-1] == "\n"
end
yield tokens
end
end
end


# =============================================================================
# MAIN ------------------------------------------------------------------------
cvs =CSVParser.new($stdin.read)
Start = "'"
End = "'\n"
Sep = "','"
cvs.each{|row|
puts Start + row[0].to_s + Sep + row.join(Sep) + End if row[2].to_i <= 4
00000 and row.last != ''
}


regards
ralf
 
W

William James

Ryan said:
There is test_csv.rb in the ruby tarball. Can you run your new code
against it to make sure it is complete? With good profile numbers I
doubt it'd be hard to get the slower code replaced.

Wow. test_csv.rb is beyond my comprehension. I don't know how
to use it.

I did lift a very complex test string from it to use in testing
my program. One of the fields in that csv string is defective;
I don't know whether that was intentional or not:

"\r\n"\r\nNaHi,

The " in the field isn't doubled, and the field doesn't end
with a quote.

Incidentally, when my program converts that string to an array
and then back to a csv string, it's not the same as
the original string because ,"", is shortened to ,, .

I corrected a minor bug in my code by moving
",".is_fs if $csv_fs.nil?
to its proper location.

The program conforms to the csv specification at this site:
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
and it handles the sample csv records shown there.

All my program can do is read a text file containing csv records,
convert those records (strings) into arrays of strings, and
convert the arrays back into csv strings. I suppose that the
csv library that comes with Ruby may do more than that.


% ## Read, parse, and create csv records.
% ## Has a minor bug fix; discard previous versions.
% ## 2005-02-01.
%
% class Array
% def to_csv
% ",".is_fs if $csv_fs.nil?
% s = ''
% self.map { |item|
% str = item.to_s
% # Quote the string if it contains the field-separator or
% # a " or a newline, or if it has leading or trailing
whitespace.
% if str.index($csv_fs) or /^\s|"|\n|\s$/.match(str)
% str = '"' + str.gsub( /"/, '""' ) + '"'
% end
% str
% }.join($csv_fs)
% end
% def unescape
% self.map{|x| x.gsub( /""/, '"' ) }
% end
% end
%
% class String
% # Set regexp for parse_csv.
% # self is the field-separator, which must be
% # a single character.
% def is_fs
% $csv_fs = self
% if "^" == $csv_fs
% fs = "\\^"
% else
% fs = $csv_fs
% end
% $csv_re = \
% ## Assumes embedded quotes are escaped as "".
% %r{ \s*
% (?:
% "( [^"]* (?: "" [^"]* )* )" |
% ( .*? )
% )
% \s*
% [#{fs}]
% }mx
% end
%
% def parse_string
% ",".is_fs if $csv_fs.nil?
% (self + $csv_fs).scan( $csv_re ).flatten.compact.unescape
% end
% end
%
% def get_rec( file )
% $csv_s = ""
% begin
% if file.eof?
% raise "The csv file is malformed." if $csv_s.size>0
% return nil
% end
% $csv_s += file.gets
% end until $csv_s.count( '"' ) % 2 == 0
% $csv_s.chomp!
% $csv_s.parse_string
% end
%
%
% # while rec = get_rec( ARGF )
% # puts "----------------"
% # puts $csv_s
% # p rec
% # puts rec.to_csv
% # end
%
% ## Here is my breakdown of the test string from test-csv.rb.
% # foo,
% # """foo""",
% # "foo,bar",
% # """""",
% # "",
% # ,
% # "\r",
% # "\r\n""\r\nNaHi", <---<< Corrected.
% # """Na""",
% # "Na,Hi",
% # "\r.\n",
% # "\r\n\n",
% # """",
% # "\n",
% # "\r\n"
%
% # Original.
% csvStr = ("foo,!!!foo!!!,!foo,bar!,!!!!!!,!!,," +
% "!\r!,!\r\n!\r\nNaHi,!!!Na!!!,!Na,Hi!," +
% "!\r.\n!,!\r\n\n!,!!!!,!\n!,!\r\n!").gsub('!', '"')
%
% # Corrected?
% csvStr = ("foo,!!!foo!!!,!foo,bar!,!!!!!!,!!,," +
% "!\r!,!\r\n!!\r\nNaHi!,!!!Na!!!,!Na,Hi!," +
% "!\r.\n!,!\r\n\n!,!!!!,!\n!,!\r\n!").gsub('!', '"')
%
% p csvStr
% arry = csvStr.parse_string
% p arry
% newCsvStr = arry.to_csv
% p newCsvStr
% arry2 = newCsvStr.parse_string
% puts "Arrays match." if arry == arry2
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,166
Messages
2,570,907
Members
47,448
Latest member
DeanaQ4445

Latest Threads

Top