grep a csv?

William James · Aug 16, 2007

Michael said:
Michael said:

M. Edward (Ed) Borasky said:

Michael Linfield wrote:
### this sadly only returned an output of => []
any ideas?
Thanks!
OK ... first of all, define "huge" and what are your restrictions? Let
me assume the worst case just to get started -- more than 256 columns
and more than 65536 rows and you're on Windows.
Seriously, though, if this is a *recurring* use case rather than a
one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
tonight!" use case, I'd load it into a database (assuming your database
doesn't have a column count limitation larger than the column count in
your file, that is) and then hook up to it with DBI. But if it's a
one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
for being able to write that in Ruby and get it debugged before someone
who's been doing command-line for years types that one-liner in.

Click to expand...

Click to expand...

lol, alright lets say the senario will be in the range of 20k - 70k
lines of data. no more than 20 columns
and i wanna avoid using command line to do this, because yes in fact
this will be used to process more than one datafile which i hope to
setup in optparse to have a command line arg that directs the prog to
the file. also i wanted to for the meantime not have to throw it on any
database...avoiding DBI for the meanwhile. But an idea flew through my
head a few minutes ago....what if i did this --

Click to expand...

res = []
res << File.readlines('filename.csv').grep(/Blah1/) #thanks chris

Click to expand...

There's a problem with using File.readlines that I don't think anyone's
mentioned yet. I don't know if it's relevant to your dataset, but CSV
fields are allowed to contain newlines if the field is quoted. For
example, this single CSV row will break your process:

1,2,"foo
Blah1",bar

I think that this can be handled easily by this approach:
to extract a record from the csv file, continue reading lines
until the number of double quotes in the record is even.
Something like

record = ""
begin
record << gets.chomp
end until record.count( '"' ) % 2 == 0

William James · Aug 16, 2007

I guess the following is slightly OT, since the OP is talking about
grabbing whole lines, but it's a CSV-relevant question:

Click to expand...

Another danger with not using CSV packages (as I've learned from my
present non-use of said packages) is that quoted elements in a row can
contain commas. I'm a fairly inexperienced programmer, and not just in
Ruby, and I haven't yet figured out an elegant way to break this down.

Click to expand...

For example:

Click to expand...

foo,bar,"foo,bar" is a three-column row in a CSV file, but using
split(/,/) on it will, of course, return ["foo","bar","\"foo","bar
\""], an array of size four. What's an efficient, elegant way of
gathering quoted columns?

Click to expand...

If you ignore that the quote character can also appear inside column
data, then this will work, ishkinda.

'foo,bar,"foo,bar"'.scan(/("[^"]+")|([^,]+)/).flatten.compact
=> ["foo", "bar", "\"foo,bar\""]

That breaks at least for empty fields, fields with newlines, and fields
with '"' in them.

I think that this will work correctly with any complete csv record.

class String
def csv
if include? '"'
ary =
"#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
raise "Bad csv record:\n#{self}" if $' != ""
ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
else
ary = chomp.split( /,/, -1)
## "".csv ought to be [""], not [], just as
## ",".csv is ["",""].
if [] == ary
[""]
else
ary
end
end
end
end

William James · Aug 16, 2007

Michael said:
Michael said:

M. Edward (Ed) Borasky wrote:
Michael Linfield wrote:
### this sadly only returned an output of => []
any ideas?
Thanks!
OK ... first of all, define "huge" and what are your restrictions? Let
me assume the worst case just to get started -- more than 256 columns
and more than 65536 rows and you're on Windows.
Seriously, though, if this is a *recurring* use case rather than a
one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
tonight!" use case, I'd load it into a database (assuming your database
doesn't have a column count limitation larger than the column count in
your file, that is) and then hook up to it with DBI. But if it's a
one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
for being able to write that in Ruby and get it debugged before someone
who's been doing command-line for years types that one-liner in.
lol, alright lets say the senario will be in the range of 20k - 70k
lines of data. no more than 20 columns
and i wanna avoid using command line to do this, because yes in fact
this will be used to process more than one datafile which i hope to
setup in optparse to have a command line arg that directs the prog to
the file. also i wanted to for the meantime not have to throw it on any
database...avoiding DBI for the meanwhile. But an idea flew through my
head a few minutes ago....what if i did this --
res = []
res << File.readlines('filename.csv').grep(/Blah1/) #thanks chris

Click to expand...

Click to expand...

There's a problem with using File.readlines that I don't think anyone's
mentioned yet. I don't know if it's relevant to your dataset, but CSV
fields are allowed to contain newlines if the field is quoted. For
example, this single CSV row will break your process:

Click to expand...

1,2,"foo
Blah1",bar

Click to expand...

I think that this can be handled easily by this approach:
to extract a record from the csv file, continue reading lines
until the number of double quotes in the record is even.
Something like

record = ""
begin
record << gets.chomp
end until record.count( '"' ) % 2 == 0

The "chomp" is a mistake.

record = ""
begin
record << gets
end until record.count( '"' ) % 2 == 0

Michael Linfield · Aug 16, 2007

From: rio4ruby [mailto:[email protected]]
# require 'rio'
# rio('filename.csv').chomp.lines(/Blah[^,]*/) do |line,m|
# rio(m) + '.csv' << line + $/
# end

simply amazing. btw, how does rio handle big files, does it load them
whole in memory?

thanks for rio.
kind regards -botp

it seems things have been amped a few levels of complication since my
first few post lol. The quote above might seem like the cleanest way to
do this, however if i use this method...ill still need the commas,
because when u take a csv and put it in simple text, the commas are what
seperate the columns. so maybe it should look something like this?

require 'rio'

rio('filename.csv').chomp.lines(/Blah1/) do |line,m|
rio(m) + '.csv' << line + $/
end
###

Peña, Botp · Aug 17, 2007

From: rio4ruby [mailto:[email protected]]=20
# > simply amazing. btw, how does rio handle big files, does it=20
# load them whole in memory?
# Never. Examples that assume I have a file small enough to load into
# memory irritate me.

now that is cool indeed.=20
thanks for rio -botp

rio4ruby · Aug 17, 2007

From: rio4ruby [mailto:[email protected]]
# require 'rio'
# rio('filename.csv').chomp.lines(/Blah[^,]*/) do |line,m|
# rio(m) + '.csv' << line + $/
# end

Click to expand...

simply amazing. btw, how does rio handle big files, does it load them
whole in memory?

Click to expand...

thanks for rio.
kind regards -botp

Click to expand...

it seems things have been amped a few levels of complication since my
first few post lol. The quote above might seem like the cleanest way to
do this, however if i use this method...ill still need the commas,
because when u take a csv and put it in simple text, the commas are what
seperate the columns. so maybe it should look something like this?

require 'rio'

rio('filename.csv').chomp.lines(/Blah1/) do |line,m|
rio(m) + '.csv' << line + $/
end
###

You are right, that was a poorly thought out regular expression.
One could also use Rio's csv mode (which uses the stdlib csv):

rio('filename.csv').chomp.csv.records(/Blah\w*/) do |rec,m|
rio(m.to_s + '.csv').csv << rec
end

But this also is definately NOT a robust solution.

Ruport GraphReport	2	Aug 16, 2007
Ruport problems	2	Aug 26, 2009
ruport	1	Aug 17, 2007
[ANN] ruport-util 0.7.0 : csv2ods and other fun stuff	2	Jun 7, 2007
hopefully my last range date question	0	Sep 3, 2007
script errors	5	Sep 4, 2007
Grouping on and exporting to csv files	1	Mar 20, 2013
parse a csv file into a text file	29	Feb 6, 2014

grep a csv?

William James

William James

William James

Michael Linfield

Peña, Botp

rio4ruby

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads