Why csv file processing is so slow?

N

NAKAMURA, Hiroshi

--------------enig12CF5C59A566E512C474C1B1
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi,

William said:
I did lift a very complex test string from it to use in testing
my program. One of the fields in that csv string is defective;
I don't know whether that was intentional or not:

"\r\n"\r\nNaHi,

The " in the field isn't doubled, and the field doesn't end
with a quote.

The second \r\n is a record separator. Here's from test_csv.rb.

# sample data
#
# 1 2 3 4 5 6 7 8
# +------+-------+---------+-------+--------+------+----+------+
# | foo | "foo" | foo,bar | "" |(empty) |(null)| \r | \r\n |
# +------+-------+---------+-------+--------+------+----+------+
# | NaHi | "Na" | Na,Hi | \r.\n | \r\n\n | " | \n | \r\n |
# +------+-------+---------+-------+--------+------+----+------+
#

The table contains 2 records and each record has 8 fields.
Incidentally, when my program converts that string to an array
and then back to a csv string, it's not the same as
the original string because ,"", is shortened to ,, .

In the csv.rb, string ""(0x22 0x22) means empty string, and empty string
means NULL. I needed to distinguish it when I first wrote that.

And here's some scenarios you may be interested in.

irb(main):001:0> require 'csv'
=> true
irb(main):002:0> CSV.parse('"abc"def')
CSV::IllegalFormatError: CSV::IllegalFormatError
from /usr/local/lib/ruby/1.9/csv.rb:587:in `get_row'
from /usr/local/lib/ruby/1.9/csv.rb:536:in `each'
from /usr/local/lib/ruby/1.9/csv.rb:107:in `collect'
from /usr/local/lib/ruby/1.9/csv.rb:107:in `parse'
from (irb):2
irb(main):003:0> CSV.parse('"abc"def', 'def')
=> [["abc", nil]]
irb(main):004:0> CSV.parse('"abc"def"ghi"', 'd', 'f')
=> [["abc", "e"], ["ghi"]]
irb(main):005:0> CSV.parse('aaabaaacaaabaa', 'ab', 'ac')
=> [["aa", "aa"], ["aa", "aa"]]
irb(main):006:0> quit
% echo foo,bar | ruby -rcsv -e 'CSV.parse(STDIN) { |row| p row }'
["foo", "bar"]

Of cource I don't think everyone needs this "complexity" (and slowness).
Regexp based approach is very useful, too. (I often do that.)

Back to the original post of this thread, wasting 110 sec. for parsing
27000 CSV records seems too slow even if it was written in pure Ruby.
Can I have the csv file? I want to do profiling with the data...

Regards,
// NaHi

--------------enig12CF5C59A566E512C474C1B1
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Cygwin)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB/58Bf6b33ts2dPkRAvOxAJ94yhT5Up9iCR3Gf+66w5pVLYJGegCaAtx/
Awk8IZLrM+V94OHDN0nLzog=
=7aKB
-----END PGP SIGNATURE-----

--------------enig12CF5C59A566E512C474C1B1--
 
M

mepython

Here is time parsing csv with regexCSV and standard CSV:

time ruby regexCSV.rb
26908

real 0m10.072s
user 0m9.414s
sys 0m0.660s


[root@taamportable GMS]# time ruby standardCSV.rb
26907

real 1m48.296s
user 1m36.853s
sys 0m11.188s


Significantly higher than python csv (which is written in c)

[root@taamportable GMS]# time python x.py
26907


real 0m0.311s
user 0m0.302s
sys 0m0.009s
 
W

William James

Of cource I don't think everyone needs this "complexity" (and slowness).
Regexp based approach is very useful, too. (I often do that.)


Added a bit of complexity in the form of error-checking.


z ## Read, parse, and create csv records.
z ## 2005-02-01, v. 2.
z ## Added a pinch of optional error-checking.
z
z # The program conforms to the csv specification at this site:
z # http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
z # The only extra is that you can change the field-separator.
z # For a field-separator other than a comma, for example
z # a semicolon:
z # ";".is_fs
z #
z # After a record has been read and parsed,
z # $csv_s contains the record in raw string format.
z #
z # If $csv_error_check == true, fields will be checked
z # for improperly escaped double-quotes.
z
z class Array
z def to_csv
z ",".is_fs if $csv_fs.nil?
z s = ''
z self.map { |item|
z str = item.to_s
z # Quote the string if it contains the field-separator or
z # a " or a newline, or if it has leading or
z # trailing whitespace.
z if str.index($csv_fs) or /^\s|"|\n|\s$/.match(str)
z str = '"' + str.gsub( /"/, '""' ) + '"'
z end
z str
z }.join($csv_fs)
z end
z def unescape
z if $csv_error_check
z self.map{ |x|
z # Check for improperly escaped double-quotes.
z raise "Bad field: #{x}" if x.gsub(/""/,'').index('"')
z x.gsub( /""/, '"' ) }
z else
z self.map{|x| x.gsub( /""/, '"' ) }
z end
z end
z end
z
z class String
z # Set regexp for parse_csv.
z # self is the field-separator, which must be
z # a single character.
z def is_fs
z $csv_fs = self
z if "^" == $csv_fs
z fs = "\\^"
z else
z fs = $csv_fs
z end
z $csv_re = \
z ## Assumes embedded quotes are escaped as "".
z %r{ \s*
z (?:
z "( [^"]* (?: "" [^"]* )* )" |
z ( .*? )
z )
z \s*
z [#{fs}]
z }mx
z end
z
z def parse_string
z ",".is_fs if $csv_fs.nil?
z (self + $csv_fs).scan( $csv_re ).flatten.compact.unescape
z end
z
z end
z
z def get_rec( file )
z $csv_s = ""
z begin
z if file.eof?
z raise "The csv file is malformed." if $csv_s.size>0
z return nil
z end
z $csv_s += file.gets
z end until $csv_s.count( '"' ) % 2 == 0
z $csv_s.chomp!
z $csv_s.parse_string
z end
z
z $csv_error_check = true
z
z while rec = get_rec( ARGF )
z puts "----------------"
z puts $csv_s
z p rec
z puts rec.to_csv
z end
 
W

William James

NAKAMURA, Hiroshi wrote
# 1 2 3 4 5 6 7 8
# +------+-------+---------+-------+--------+------+----+------+
# | foo | "foo" | foo,bar | "" |(empty) |(null)| \r | \r\n |
# +------+-------+---------+-------+--------+------+----+------+
# | NaHi | "Na" | Na,Hi | \r.\n | \r\n\n | " | \n | \r\n |
# +------+-------+---------+-------+--------+------+----+------+
#

foo | "foo" | foo,bar | "" | | | \r | \r\n |
NaHi | "Na" | Na,Hi | \r.\n | \r\n\n | " | \n | \r\n |

is what my program produces. The only difference is that
I recognize fields that are empty strings, but not NULL fields;
i.e., these two csv records are equivalent:

foo,"",bar
foo,,bar

To produce the above output, I wrote the test string to a file
and ran this:

z recs = []
z while rec = get_rec( ARGF )
z recs << rec.map { |field|
z field.gsub!( /\r|\n/ ) {|s| ("\r"==s) ? '\r' : '\n' }
z field
z }
z end
z recs.each_with_index { |rec,ri|
z rec.each_with_index { |field,fi|
z width = [field.size, recs[(ri-1).abs][fi].size ].max
z printf "%-#{width}s | ", field
z }
z puts ""
z }
 
W

William James

For speed fanatics only.

I racked my brain trying to come up with a way to make
this faster. The best method I could find produced
only a modest 14% speedup with my test data.


z ## Read, parse, and create csv records.
z ## 2005-02-03
z ## Added a faster mode.
z
z # The program conforms to the csv specification at this site:
z # http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
z # The only extra is that you can change the field-separator.
z # For a field-separator other than a comma, for example
z # a semicolon:
z # ";".is_fs
z #
z # After a record has been read and parsed,
z # $csv_s contains the record in raw string format.
z #
z # If $csv_error_check == true, fields will be checked
z # for improperly escaped double-quotes.
z #
z # If $csv_fast == true, a slightly faster parser will be used.
z # Differences: the csv file cannot contain 1.chr;
z # an empty line will be parsed to [] instead of [""].
z
z class Array
z def to_csv
z ",".is_fs if $csv_fs.nil?
z s = ''
z self.map { |item|
z str = item.to_s
z # Quote the string if it contains the field-separator or
z # a " or a newline, or if it has leading or
z # trailing whitespace.
z if str.index($csv_fs) or /^\s|"|\n|\s$/.match(str)
z str = '"' + str.gsub( /"/, '""' ) + '"'
z end
z str
z }.join($csv_fs)
z end
z def unescape
z self.map{|x| x.gsub( /""/, '"' ) }
z end
z end
z
z class String
z # Set regexp for parse_csv.
z # self is the field-separator, which must be
z # a single character.
z def is_fs
z $csv_fs = self
z if "^" == $csv_fs
z fs = "\\^"
z else
z fs = $csv_fs
z end
z $csv_re = \
z ## Assumes embedded quotes are escaped as "".
z %r{ \s*
z (?:
z "( [^"]* (?: "" [^"]* )* )" |
z ( .*? )
z )
z \s*
z [#{fs}]
z }mx
z end
z
z def parse_string
z ",".is_fs if $csv_fs.nil?
z
z if $csv_fast
z
z # Place 1.chr after each field;
z # unescape quotes;
z # make the array.
z (self + $csv_fs).gsub( $csv_re, '\1\2'+"\1" )\
z .gsub( /""/, '"' )\
z [0..-2].split( "\1", -1 )
z
z else
z (self + $csv_fs).scan( $csv_re ).flatten.compact.unescape
z end
z end
z
z end
z
z
z def get_rec( file )
z $csv_s = ""
z begin
z if file.eof?
z raise "The csv file is malformed." if $csv_s.size>0
z return nil
z end
z $csv_s += file.gets
z end until $csv_s.count( '"' ) % 2 == 0
z $csv_s.chomp!
z $csv_s.parse_string
z end
z
z
z $csv_fast = true
z
z while rec = get_rec( ARGF )
z puts "----------------"
z puts $csv_s
z p rec
z puts rec.to_csv
z end
 
W

William James

The fast mode now parses an empty line exactly as the
slow mode does.

z ## Read, parse, and create csv records.
z ## 2005-02-04
z ## Added a slightly faster mode.
z
z # The program conforms to the csv specification at this site:
z # http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
z # The only extra is that you can change the field-separator.
z # For a field-separator other than a comma, for example
z # a semicolon:
z # ";".is_fs
z #
z # After a record has been read and parsed,
z # $csv_s contains the record in raw string format.
z #
z # If $csv_error_check == true, fields will be checked
z # for improperly escaped double-quotes.
z #
z # If $csv_fast == true, a slightly faster parser will be used.
z # Limitation: the csv file cannot contain 1.chr.
z
z class Array
z def to_csv
z ",".is_fs if $csv_fs.nil?
z s = ''
z self.map { |item|
z str = item.to_s
z # Quote the string if it contains the field-separator or
z # a " or a newline, or if it has leading or
z # trailing whitespace.
z if str.index($csv_fs) or /^\s|"|\n|\s$/.match(str)
z str = '"' + str.gsub( /"/, '""' ) + '"'
z end
z str
z }.join($csv_fs)
z end
z def unescape
z if $csv_error_check
z self.map{ |x|
z # Check for improperly escaped double-quotes.
z if x.gsub(/""/,'').index('"')
z raise "Bad quotes in field: #{x}"
z end
z x.gsub( /""/, '"' ) }
z else
z self.map{|x| x.gsub( /""/, '"' ) }
z end
z end
z end
z
z class String
z # Set regexp for parse_csv.
z # self is the field-separator, which must be
z # a single character.
z def is_fs
z $csv_fs = self
z if "^" == $csv_fs
z fs = "\\^"
z else
z fs = $csv_fs
z end
z $csv_re = \
z ## Assumes embedded quotes are escaped as "".
z %r{ \s*
z (?:
z "( [^"]* (?: "" [^"]* )* )" |
z ( .*? )
z )
z \s*
z [#{fs}]
z }mx
z end
z
z def parse_string
z ",".is_fs if $csv_fs.nil?
z if $csv_fast
z
z # Unquote fields, remove field-separators, and
z # place 1.chr after each field.
z str = (self + $csv_fs).gsub($csv_re, '\1\2' + 1.chr )\
z # Check for improperly escaped double-quotes.
z if $csv_error_check and str.gsub(/""/,'').index('"')
z raise "Bad quotes in csv record."
z end
z # Unescape quotes;
z # make the array.
z str.gsub( /""/, '"' )\
z .split( 1.chr, -1 )[0..-2]
z
z else
z (self + $csv_fs).scan( $csv_re ).flatten.compact.unescape
z end
z end
z end
z
z def get_rec( file )
z $csv_s = ""
z begin
z if file.eof?
z raise "The csv file is malformed." if $csv_s.size>0
z return nil
z end
z $csv_s += file.gets
z end until $csv_s.count( '"' ) % 2 == 0
z $csv_s.chomp!
z $csv_s.parse_string
z end
z
z # $csv_error_check = true
z # $csv_fast = true
z
z while rec = get_rec( ARGF )
z puts "----------------"
z puts $csv_s
z p rec
z puts rec.to_csv
z end
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,166
Messages
2,570,907
Members
47,448
Latest member
DeanaQ4445

Latest Threads

Top