N
NAKAMURA, Hiroshi
--------------enig12CF5C59A566E512C474C1B1
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Hi,
The second \r\n is a record separator. Here's from test_csv.rb.
# sample data
#
# 1 2 3 4 5 6 7 8
# +------+-------+---------+-------+--------+------+----+------+
# | foo | "foo" | foo,bar | "" |(empty) |(null)| \r | \r\n |
# +------+-------+---------+-------+--------+------+----+------+
# | NaHi | "Na" | Na,Hi | \r.\n | \r\n\n | " | \n | \r\n |
# +------+-------+---------+-------+--------+------+----+------+
#
The table contains 2 records and each record has 8 fields.
In the csv.rb, string ""(0x22 0x22) means empty string, and empty string
means NULL. I needed to distinguish it when I first wrote that.
And here's some scenarios you may be interested in.
irb(main):001:0> require 'csv'
=> true
irb(main):002:0> CSV.parse('"abc"def')
CSV::IllegalFormatError: CSV::IllegalFormatError
from /usr/local/lib/ruby/1.9/csv.rb:587:in `get_row'
from /usr/local/lib/ruby/1.9/csv.rb:536:in `each'
from /usr/local/lib/ruby/1.9/csv.rb:107:in `collect'
from /usr/local/lib/ruby/1.9/csv.rb:107:in `parse'
from (irb):2
irb(main):003:0> CSV.parse('"abc"def', 'def')
=> [["abc", nil]]
irb(main):004:0> CSV.parse('"abc"def"ghi"', 'd', 'f')
=> [["abc", "e"], ["ghi"]]
irb(main):005:0> CSV.parse('aaabaaacaaabaa', 'ab', 'ac')
=> [["aa", "aa"], ["aa", "aa"]]
irb(main):006:0> quit
% echo foo,bar | ruby -rcsv -e 'CSV.parse(STDIN) { |row| p row }'
["foo", "bar"]
Of cource I don't think everyone needs this "complexity" (and slowness).
Regexp based approach is very useful, too. (I often do that.)
Back to the original post of this thread, wasting 110 sec. for parsing
27000 CSV records seems too slow even if it was written in pure Ruby.
Can I have the csv file? I want to do profiling with the data...
Regards,
// NaHi
--------------enig12CF5C59A566E512C474C1B1
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Cygwin)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFB/58Bf6b33ts2dPkRAvOxAJ94yhT5Up9iCR3Gf+66w5pVLYJGegCaAtx/
Awk8IZLrM+V94OHDN0nLzog=
=7aKB
-----END PGP SIGNATURE-----
--------------enig12CF5C59A566E512C474C1B1--
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Hi,
William said:I did lift a very complex test string from it to use in testing
my program. One of the fields in that csv string is defective;
I don't know whether that was intentional or not:
"\r\n"\r\nNaHi,
The " in the field isn't doubled, and the field doesn't end
with a quote.
The second \r\n is a record separator. Here's from test_csv.rb.
# sample data
#
# 1 2 3 4 5 6 7 8
# +------+-------+---------+-------+--------+------+----+------+
# | foo | "foo" | foo,bar | "" |(empty) |(null)| \r | \r\n |
# +------+-------+---------+-------+--------+------+----+------+
# | NaHi | "Na" | Na,Hi | \r.\n | \r\n\n | " | \n | \r\n |
# +------+-------+---------+-------+--------+------+----+------+
#
The table contains 2 records and each record has 8 fields.
Incidentally, when my program converts that string to an array
and then back to a csv string, it's not the same as
the original string because ,"", is shortened to ,, .
In the csv.rb, string ""(0x22 0x22) means empty string, and empty string
means NULL. I needed to distinguish it when I first wrote that.
And here's some scenarios you may be interested in.
irb(main):001:0> require 'csv'
=> true
irb(main):002:0> CSV.parse('"abc"def')
CSV::IllegalFormatError: CSV::IllegalFormatError
from /usr/local/lib/ruby/1.9/csv.rb:587:in `get_row'
from /usr/local/lib/ruby/1.9/csv.rb:536:in `each'
from /usr/local/lib/ruby/1.9/csv.rb:107:in `collect'
from /usr/local/lib/ruby/1.9/csv.rb:107:in `parse'
from (irb):2
irb(main):003:0> CSV.parse('"abc"def', 'def')
=> [["abc", nil]]
irb(main):004:0> CSV.parse('"abc"def"ghi"', 'd', 'f')
=> [["abc", "e"], ["ghi"]]
irb(main):005:0> CSV.parse('aaabaaacaaabaa', 'ab', 'ac')
=> [["aa", "aa"], ["aa", "aa"]]
irb(main):006:0> quit
% echo foo,bar | ruby -rcsv -e 'CSV.parse(STDIN) { |row| p row }'
["foo", "bar"]
Of cource I don't think everyone needs this "complexity" (and slowness).
Regexp based approach is very useful, too. (I often do that.)
Back to the original post of this thread, wasting 110 sec. for parsing
27000 CSV records seems too slow even if it was written in pure Ruby.
Can I have the csv file? I want to do profiling with the data...
Regards,
// NaHi
--------------enig12CF5C59A566E512C474C1B1
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Cygwin)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFB/58Bf6b33ts2dPkRAvOxAJ94yhT5Up9iCR3Gf+66w5pVLYJGegCaAtx/
Awk8IZLrM+V94OHDN0nLzog=
=7aKB
-----END PGP SIGNATURE-----
--------------enig12CF5C59A566E512C474C1B1--