FasterCSV parsing issues

  • Thread starter Jeremy Woertink
  • Start date
J

Jeremy Woertink

I'm using FasterCSV to do an import into my DB, and the CSV file
contains European words. I have French, Italian, and German words which
contain accents and such. When I try the import it throws a
FasterCSV::MalformedCSV error, but if I remove just the letters with
accents on them, it will upload just fine.

Here is a sample row:

Universal,ID,Kir,"Commonly, white wine with Cassis. Traditionally, the
cocktail kir (also known as vin blanc cassis in French) is made with
Aligot=C3=A9. Kir Royal is made with Champagne instead of Aligot=C3=A9."

Notice the 2 "e" with accents on them. I can remove these and it's fine.
I'm assuming this is an encoding issue. The CSV file could be created by
any number of people in any number of different locations using any
number of programs. Do I need to do something like use Iconv to convert
to a standard encoding first, then upload?


Thanks

~Jeremy

-- =

Posted via http://www.ruby-forum.com/.=
 
N

Nathaniel Smith

I've had similar issues recently, and they are due to character
encodings. Something like Iconv will probably be necessary to convert
the files to a standard encoding
 
J

Jeremy Woertink

Nathaniel Smith wrote in post #965441:
I've had similar issues recently, and they are due to character
encodings. Something like Iconv will probably be necessary to convert
the files to a standard encoding
I've never actually used Iconv before, but I was just reading =

http://blog.grayproductions.net/articles/encoding_conversion_with_iconv =

and I did a test. I converted from ISO8859-1 to UTF8, and that actually =

changes the characters, so it changes the meaning of the words. Now, =

this is assuming that the CSV files I'm getting are all ISO8859-1 =

encoded (which I think they are).

I tried a test to just tell FasterCSV to read it as 'ISO8859-1'using the =

first 3 lines of this CSV file:


Universal,ID,Kir,"Commonly, white wine with Cassis. Traditionally, the =

cocktail kir (also known as vin blanc cassis in French) is made with =

Aligot=C3=88. Kir Royal is made with Champagne instead of Aligot=C3=88."
Universal,GRAPE,Mourv=C3=8Bdre / Monastrell / Mataro,"Grape: Mourv=C3=8Bd=
re, =

Matar=C3=9B, or Monastrell is variety of grape used to make both strong, =
dark =

red wines and ros=C3=88s. It is grown in many regions around the world.
Universal,Tasting,Leafy,Specific aroma/taste descriptor: Having the =

smell or taste sensation of Leaves.


ruby-1.8.7-p302 > file =3D File.open(File.join(Rails.root, 'public', =

'sample.csv'))
=3D> #<File:/Users/jeremywoertink/Sites/winovations/public/sample.csv>
ruby-1.8.7-p302 > csv =3D FasterCSV.new(file, :encoding =3D> 'ISO8859-1')=

=3D> <#FasterCSV io_type:File =

io_path:"/Users/jeremywoertink/Sites/winovations/public/sample.csv" =

lineno:0 col_sep:"," row_sep:"\n" quote_char:"\"" encoding:"ISO8859-1">
ruby-1.8.7-p302 > csv.each { |row| puts row }
Universal
ID
Kir
Commonly, white wine with Cassis. Traditionally, the cocktail kir (also =

known as vin blanc cassis in French) is made with Aligot=C3=88. Kir Royal=
is =

made with Champagne instead of Aligot=C3=88.
FasterCSV::MalformedCSVError: Unclosed quoted field on line 2.
from =

/Users/jeremywoertink/.rvm/gems/ruby-1.8.7-p302/gems/fastercsv-1.5.3/lib/=
faster_csv.rb:1663:in =

`shift'
from =

/Users/jeremywoertink/.rvm/gems/ruby-1.8.7-p302/gems/fastercsv-1.5.3/lib/=
faster_csv.rb:1581:in =

`loop'
from =

/Users/jeremywoertink/.rvm/gems/ruby-1.8.7-p302/gems/fastercsv-1.5.3/lib/=
faster_csv.rb:1581:in =

`shift'
from =

/Users/jeremywoertink/.rvm/gems/ruby-1.8.7-p302/gems/fastercsv-1.5.3/lib/=
faster_csv.rb:1526:in =

`each'
from (irb):28

I'm not seeing any unclosed quotes... Also, I thought that when you =

iterate through the returned csv file, it gives you rows, but this one =

seems to be giving my columns on the first row, then dies when it hits =

the second row.

-- =

Posted via http://www.ruby-forum.com/.=
 
J

James Edward Gray II

I'm using FasterCSV to do an import into my DB, and the CSV file
contains European words. I have French, Italian, and German words which
contain accents and such. When I try the import it throws a
FasterCSV::MalformedCSV error, but if I remove just the letters with
accents on them, it will upload just fine.
The CSV file could be created by
any number of people in any number of different locations using any
number of programs. Do I need to do something like use Iconv to convert
to a standard encoding first, then upload?

Yes, that's exactly the strategy you need to adopt.

James Edward Gray II
 
J

James Edward Gray II

I've never actually used Iconv before, but I was just reading=20
= http://blog.grayproductions.net/articles/encoding_conversion_with_iconv=20=

and I did a test. I converted from ISO8859-1 to UTF8, and that = actually=20
changes the characters, so it changes the meaning of the words. Now,=20=
this is assuming that the CSV files I'm getting are all ISO8859-1=20
encoded (which I think they are).

You probably want to hit the files with some encoding guessing script to =
be sure.
I tried a test to just tell FasterCSV to read it as 'ISO8859-1'using = the=20
first 3 lines of this CSV file:
ruby-1.8.7-p302 >

On Ruby 1.8.7, FasterCSV supports only four encodings (the same four =
Ruby does) and Latin-1 (ISO-8859-1) isn't one of them. You need to =
transcode the data to UTF-8 on the way in or use the standard CSV =
library in Ruby 1.9 (which can parse Latin-1 directly).

James Edward Gray II=
 
J

Jeremy Woertink

Thanks for the info, James.

I've upgraded to Ruby 1.9.2 now, but I'm still running into weird
issues. How come I can only parse a file once?

ruby-1.9.2-p0 > file = File.open(File.join(Rails.root, 'public',
'sample.csv'))
=> #<File:/Users/jeremywoertink/Sites/winovations/public/sample.csv>
ruby-1.9.2-p0 > csv = CSV.new(file)
=> <#CSV io_type:File
io_path:"/Users/jeremywoertink/Sites/winovations/public/sample.csv"
encoding:ISO-8859-1 lineno:0 col_sep:"," row_sep:"\n" quote_char:"\"">
ruby-1.9.2-p0 > csv.each { |row| puts row[1] }
...
...
ruby-1.9.2-p0 > csv.each { |row| puts row[1] }
=> nil


Thanks,
~Jeremy
 
J

James Edward Gray II

I've upgraded to Ruby 1.9.2 now, but I'm still running into weird=20
issues. How come I can only parse a file once?

For the same reason you could only read from an IO object once: it's =
tracking your position. You're not at the end. However, you could =
"rewind" it:

csv =3D CSV.open(File.join(Rails.root, 'public', 'sample.csv'))
csv.each { |row| =85 }
csv.rewind
csv.each { |row| =85 }

Hope that helps.

James Edward Gray II=
 
J

Jeremy Woertink

Oh. I guess I don't spend enough time with IO stuff :p I wasn't aware of
that. Makes sense though!

Ok, sorry to throw all these out here, but I'm trying to understand this
whole thing :p

Ok, so In my sample.csv, I have 1481 lines (according to textmate). When
I print out the rows and line numbers in the console, it gets to line
1409 then stops and returns nil. There's no error or anything. Is there
a limitation, or would this be caused from a malformed csv file?
 
J

Jeremy Woertink

ok, actually... I think I get that last one. It's saying there's 1409
rows, not technically line numbers because there seems to be some
breaks.

duh.. Ok, now if I can just figure out this "Unclosed quoted field"
error and how to avoid it, I'll be good!

Thanks!
 
J

James Edward Gray II

Oh. I guess I don't spend enough time with IO stuff :p I wasn't aware = of=20
that. Makes sense though!
=20
Ok, sorry to throw all these out here, but I'm trying to understand = this=20
whole thing :p

No worries.
Ok, so In my sample.csv, I have 1481 lines (according to textmate). = When=20
I print out the rows and line numbers in the console, it gets to line=20=
1409 then stops and returns nil. There's no error or anything. Is = there=20
a limitation, or would this be caused from a malformed csv file?

It would probably be do to CSV content like:

one,"multi-line
two",three

TextMate would count that as two lines (it is) but it's only one row of =
CSV data.

James Edward Gray II=
 
J

James Edward Gray II

Ok, now if I can just figure out this "Unclosed quoted field"
error and how to avoid it, I'll be good!

That most likely extends from some invalid CSV data.

James Edward Gray II
 
B

Brian Candler

James Edward Gray II wrote in post #965490:
On Ruby 1.8.7, FasterCSV supports only four encodings (the same four
Ruby does) and Latin-1 (ISO-8859-1) isn't one of them.

But binary (-Kn) is one of them, and that should be fine for ISO-8859-1,
shouldn't it?

OP, are you running on a Mac by any chance? Apple built Ruby for OSX
with a non-standard configuration so that $KCODE="UTF8" by default. Try
using:

ruby -e 'puts $KCODE'

If it says UTF8, then try running your script again with ruby -Kn
 
J

James Edward Gray II

James Edward Gray II wrote in post #965490:

But binary (-Kn) is one of them, and that should be fine for ISO-8859-1,
shouldn't it?

Ah, yes. Excellent point.

James Edward Gray II
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,739
Latest member
Clint8040

Latest Threads

Top