grep a csv?

W

William James

Michael said:
M. Edward (Ed) Borasky said:
Michael Linfield wrote:
### this sadly only returned an output of => []
any ideas?
Thanks!
OK ... first of all, define "huge" and what are your restrictions? Let
me assume the worst case just to get started -- more than 256 columns
and more than 65536 rows and you're on Windows. :)
Seriously, though, if this is a *recurring* use case rather than a
one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
tonight!" use case, I'd load it into a database (assuming your database
doesn't have a column count limitation larger than the column count in
your file, that is) and then hook up to it with DBI. But if it's a
one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
for being able to write that in Ruby and get it debugged before someone
who's been doing command-line for years types that one-liner in. :)
lol, alright lets say the senario will be in the range of 20k - 70k
lines of data. no more than 20 columns
and i wanna avoid using command line to do this, because yes in fact
this will be used to process more than one datafile which i hope to
setup in optparse to have a command line arg that directs the prog to
the file. also i wanted to for the meantime not have to throw it on any
database...avoiding DBI for the meanwhile. But an idea flew through my
head a few minutes ago....what if i did this --
res = []
res << File.readlines('filename.csv').grep(/Blah1/) #thanks chris

There's a problem with using File.readlines that I don't think anyone's
mentioned yet. I don't know if it's relevant to your dataset, but CSV
fields are allowed to contain newlines if the field is quoted. For
example, this single CSV row will break your process:

1,2,"foo
Blah1",bar

I think that this can be handled easily by this approach:
to extract a record from the csv file, continue reading lines
until the number of double quotes in the record is even.
Something like

record = ""
begin
record << gets.chomp
end until record.count( '"' ) % 2 == 0
 
W

William James

I guess the following is slightly OT, since the OP is talking about
grabbing whole lines, but it's a CSV-relevant question:
Another danger with not using CSV packages (as I've learned from my
present non-use of said packages) is that quoted elements in a row can
contain commas. I'm a fairly inexperienced programmer, and not just in
Ruby, and I haven't yet figured out an elegant way to break this down.
For example:
foo,bar,"foo,bar" is a three-column row in a CSV file, but using
split(/,/) on it will, of course, return ["foo","bar","\"foo","bar
\""], an array of size four. What's an efficient, elegant way of
gathering quoted columns?

If you ignore that the quote character can also appear inside column
data, then this will work, ishkinda.

'foo,bar,"foo,bar"'.scan(/("[^"]+")|([^,]+)/).flatten.compact
=> ["foo", "bar", "\"foo,bar\""]

That breaks at least for empty fields, fields with newlines, and fields
with '"' in them.

I think that this will work correctly with any complete csv record.


class String
def csv
if include? '"'
ary =
"#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
raise "Bad csv record:\n#{self}" if $' != ""
ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
else
ary = chomp.split( /,/, -1)
## "".csv ought to be [""], not [], just as
## ",".csv is ["",""].
if [] == ary
[""]
else
ary
end
end
end
end
 
W

William James

Michael said:
M. Edward (Ed) Borasky wrote:
Michael Linfield wrote:
### this sadly only returned an output of => []
any ideas?
Thanks!
OK ... first of all, define "huge" and what are your restrictions? Let
me assume the worst case just to get started -- more than 256 columns
and more than 65536 rows and you're on Windows. :)
Seriously, though, if this is a *recurring* use case rather than a
one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
tonight!" use case, I'd load it into a database (assuming your database
doesn't have a column count limitation larger than the column count in
your file, that is) and then hook up to it with DBI. But if it's a
one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
for being able to write that in Ruby and get it debugged before someone
who's been doing command-line for years types that one-liner in. :)
lol, alright lets say the senario will be in the range of 20k - 70k
lines of data. no more than 20 columns
and i wanna avoid using command line to do this, because yes in fact
this will be used to process more than one datafile which i hope to
setup in optparse to have a command line arg that directs the prog to
the file. also i wanted to for the meantime not have to throw it on any
database...avoiding DBI for the meanwhile. But an idea flew through my
head a few minutes ago....what if i did this --
res = []
res << File.readlines('filename.csv').grep(/Blah1/) #thanks chris
There's a problem with using File.readlines that I don't think anyone's
mentioned yet. I don't know if it's relevant to your dataset, but CSV
fields are allowed to contain newlines if the field is quoted. For
example, this single CSV row will break your process:
1,2,"foo
Blah1",bar

I think that this can be handled easily by this approach:
to extract a record from the csv file, continue reading lines
until the number of double quotes in the record is even.
Something like

record = ""
begin
record << gets.chomp
end until record.count( '"' ) % 2 == 0

The "chomp" is a mistake.

record = ""
begin
record << gets
end until record.count( '"' ) % 2 == 0
 
M

Michael Linfield

From: rio4ruby [mailto:[email protected]]
# require 'rio'
# rio('filename.csv').chomp.lines(/Blah[^,]*/) do |line,m|
# rio(m) + '.csv' << line + $/
# end

simply amazing. btw, how does rio handle big files, does it load them
whole in memory?

thanks for rio.
kind regards -botp

it seems things have been amped a few levels of complication since my
first few post lol. The quote above might seem like the cleanest way to
do this, however if i use this method...ill still need the commas,
because when u take a csv and put it in simple text, the commas are what
seperate the columns. so maybe it should look something like this?

require 'rio'

rio('filename.csv').chomp.lines(/Blah1/) do |line,m|
rio(m) + '.csv' << line + $/
end
###
 
P

Peña, Botp

From: rio4ruby [mailto:[email protected]]=20
# > simply amazing. btw, how does rio handle big files, does it=20
# load them whole in memory?
# Never. Examples that assume I have a file small enough to load into
# memory irritate me. :)

now that is cool indeed.=20
thanks for rio -botp
 
R

rio4ruby

From: rio4ruby [mailto:[email protected]]
# require 'rio'
# rio('filename.csv').chomp.lines(/Blah[^,]*/) do |line,m|
# rio(m) + '.csv' << line + $/
# end
simply amazing. btw, how does rio handle big files, does it load them
whole in memory?
thanks for rio.
kind regards -botp

it seems things have been amped a few levels of complication since my
first few post lol. The quote above might seem like the cleanest way to
do this, however if i use this method...ill still need the commas,
because when u take a csv and put it in simple text, the commas are what
seperate the columns. so maybe it should look something like this?

require 'rio'

rio('filename.csv').chomp.lines(/Blah1/) do |line,m|
rio(m) + '.csv' << line + $/
end
###

You are right, that was a poorly thought out regular expression.
One could also use Rio's csv mode (which uses the stdlib csv):

rio('filename.csv').chomp.csv.records(/Blah\w*/) do |rec,m|
rio(m.to_s + '.csv').csv << rec
end

But this also is definately NOT a robust solution.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
474,264
Messages
2,571,315
Members
48,000
Latest member
SusannahSt

Latest Threads

Top