IO#Foreach -- Max line length

T

Tristin Davis

I'm trying to emulate the new feature in 1.9 that allows you to specify
the maximum length of a line read in Ruby 1.8.6. Can anyone help?
 
7

7stud --

Tristin said:
I'm trying to emulate the new feature in 1.9 that allows you to specify
the maximum length of a line read in Ruby 1.8.6. Can anyone help?

max = 3
count = 0

IO.foreach('data.txt') do |line|
if count == max
break
else
count += 1
end

puts line
end
 
T

Tristin Davis

But by the time you actually get count, isn't the line already read in
memory. So if the line is 7 gigabytes, it'll probably crash the system.
 
A

Arlen Cuss

[Note: parts of this message were removed to make it a legal post.]

Hi,

max = 3
count = 0

IO.foreach('data.txt') do |line|
if count == max
break
else
count += 1
end

puts line
end


Not quite the solution. This reads a number of lines, as opposed to
limiting the length of a single line read.

Arlen
 
7

7stud --

Tristin said:
But by the time you actually get count, isn't the line already read in
memory. So if the line is 7 gigabytes, it'll probably crash the system.

Is this what you are looking for:

max_bytes = 30
text = IO.read('data.txt', max_bytes)
puts text
 
P

Peña, Botp

T24gQmVoYWxmIE9mIFRyaXN0aW4gRGF2aXM6DQojIEJ1dCBieSB0aGUgdGltZSB5b3UgYWN0dWFs
bHkgZ2V0IGNvdW50LCBpc24ndCB0aGUgbGluZSANCiMgYWxyZWFkeSByZWFkIGluIA0KIyBtZW1v
cnkuICBTbyBpZiB0aGUgbGluZSBpcyA3IGdpZ2FieXRlcywgaXQnbGwgcHJvYmFibHkgY3Jhc2gg
DQojIHRoZSBzeXN0ZW0uDQoNCnJlYWQgd2lsbCBhY2NlcHQgYXJnIG9uIGhvdyBtYW55IGJ5dGVz
IHRvIHJlYWQuDQoNCnNvIGhvdyBhYm91dCwNCg0KaXJiKG1haW4pOjA0MDowPiBGaWxlLm9wZW4g
InRlc3QucmIiIGRvIHxmfCBmLnJlYWQgZW5kDQo9PiAiYT0oMS4uMilcblxuYVxucHV0cyBhXG5c
bnB1dHMgYS5lYWNoe3x4fCBwdXRzIHh9Ig0KDQppcmIobWFpbik6MDQxOjA+IEZpbGUub3BlbiAi
dGVzdC5yYiIgZG8gfGZ8IGYucmVhZCAyIGVuZA0KPT4gImE9Ig0KDQppcmIobWFpbik6MDQyOjA+
IEZpbGUub3BlbiAidGVzdC5yYiIgZG8gfGZ8IGYucmVhZCAyOyBmLnJlYWQgMiBlbmQNCj0+ICIo
MSINCg0KaXJiKG1haW4pOjA0MzowPiBGaWxlLm9wZW4gInRlc3QucmIiIGRvIHxmfCB3aGlsZSB4
PWYucmVhZCgyKTsgcCB4OyBlbmQ7IGVuZA0KImE9Ig0KIigxIg0KIi4uIg0KIjIpIg0KIlxuXG4i
DQoiYVxuIg0KInB1Ig0KInRzIg0KIiBhIg0KIlxuXG4iDQoicHUiDQoidHMiDQoiIGEiDQoiLmUi
DQoiYWMiDQoiaHsiDQoifHgiDQoifCAiDQoicHUiDQoidHMiDQoiIHgiDQoifSINCj0+IG5pbA0K
DQpraW5kIHJlZ2FyZHMgLWJvdHANCg==
 
A

Adam Shelly

On Behalf Of Tristin Davis:
# But by the time you actually get count, isn't the line
# already read in
# memory. So if the line is 7 gigabytes, it'll probably crash
# the system.

read will accept arg on how many bytes to read.

so how about,
...
irb(main):043:0> File.open "test.rb" do |f| while x=3Df.read(2); p x; end=
; end

That solution essentially ignores linebreaks.
If you want to read up to a linebreak or N characters, whichever comes
first, you could one of these:

------
class IO
#read by characters
def for_eachA(linelen)
c=3D0
while (c)
buf=3D''
linelen.times {
break unless c=3Dgetc
buf<<c
break if c.chr=3D=3D $/
}
yield buf
end
end

#read by lines
def for_eachB(linelen)
re =3D Regexp.new(".*?#{Regexp.escape($/)}")
buf=3D''
while (line =3D read(linelen-buf.length))
buf =3D (buf+line).gsub(re){|l| yield l;''}
if buf.length =3D=3D linelen
yield buf
buf=3D''
end
end
yield buf
end
end

File.open("foreach.rb") do |f|
f.for_eachA(10){|l| p l}
end

File.open("foreach.rb") do |f|
f.for_eachB(10){|l| p l}
end
 
T

Tristin Davis

Thanks for the ideas Adam. I thought someone might be able to use it so
I figured i'd post it. It processed about 675,000 1100+ byte records in
an hour. Not fantastic performance, but it works. If someone can tell
me how to improve the performance then have at it. :)


module Util

def too_large?(buffer,max=10)
return true if buffer.length >= max
false
end
end

include Util

file = ARGV.shift #"C:/Documents and Settings/trdavi/Desktop/a1-1k.aa"
buf=''
record = 1
frequency = 100

f = File.open(file,'r')

while c=f.getc
buf << c

if too_large?(buf,max=102400)
p "record #{record} is too long, skipping to end"
while(x=f.getc)
if x.chr == $/
buf=''
record += 1
p "At record #{record}" if( (record % frequency ) == 0 )
break
end
end
end

if c.chr == $/
record += 1
print "At record #{record}" if( (record % frequency ) == 0 )
buf = ''
end
end

#If we still have something in the buffer, then it is probably the last
record.
unless buf.empty?
#record += 1
p "Last record is:" + buf
end

f.close
p record
 
7

7stud --

Tristin said:
Thanks for the ideas Adam. I thought someone might be able to use it so
I figured i'd post it. It processed about 675,000 1100+ byte records in
an hour. Not fantastic performance, but it works. If someone can tell
me how to improve the performance then have at it. :)


module Util

def too_large?(buffer,max=10)
return true if buffer.length >= max
false
end
end

include Util

file = ARGV.shift #"C:/Documents and Settings/trdavi/Desktop/a1-1k.aa"
buf=''
record = 1
frequency = 100

f = File.open(file,'r')

while c=f.getc
if buf.length < max #(but what if you find a '\n' before max?)
buf << c
else
buf = ''
f.gets
end
 
T

Tristin Davis

That's what the 2nd if statement is; for catching the delimiter if the
buffer isn't too large. I can't use gets b/c I may expend all the
memory before the actual line is read. I'm reading variable length
records, but some of them are bad data and exceed a max length of 100k.
That's what the script is scanning for. :)
 
7

7stud --

Tristin said:
That's what the 2nd if statement is; for catching the delimiter if the
buffer isn't too large. I can't use gets b/c I may expend all the
memory before the actual line is read.

Look. A string and a file are really no different--except reading from
a file is slow. Therefore, to speed things up read in the maximum every
time you read from the file, and store it in a string. Process the
string just like you would the file. Then read from the file again.
 
T

Tristin Davis

Gotcha, I'll post the code once i revamp ;)
Look. A string and a file are really no different--except reading from
a file is slow. Therefore, to speed things up read in the maximum every
time you read from the file, and store it in a string. Process the
string just like you would the file. Then read from the file again.
 
T

Tristin Davis

Here's the benchmarks for the old and new code:
Old: 5.484000 0.031000 5.515000 ( 5.782000)
New: 5.094000 0.047000 5.141000 ( 5.407000)


=cut

module DataVerifier
require 'strscan'

def too_large?(buffer,max=1024)
return true if buffer.length >= max
false
end

def verify_vbl(file,frequency,max,delimiter,out,cache_size)
$/=delimiter

buffer=''
buf=''
record = 1
o = File.new(out,"w")
f = File.open(file,'r')

while(buffer=f.read(cache_size=1048576))
cache=StringScanner.new(buffer)

while(c = cache.getch)
buf << c

if too_large?(buf,max)
o.print "record #{record} is too long, skipping to end\n"
while(x=cache.getch)
if x == $/
buf=''
record += 1
print "At record #{record}\n" if( (record % frequency )
== 0 ) unless frequency.nil?
break
end
end
end

if c == $/
record += 1
print "At record #{record}\n" if( (record % frequency ) == 0
) unless frequency.nil?
buf = ''
end
end
end
f.close
o.close
record
end
end
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,999
Messages
2,570,243
Members
46,835
Latest member
lila30

Latest Threads

Top