Actually, the idiom I most use is
File.read( fname).scan( %r{ juicy stuff}x) do |match|
# do something with juicy stuff
end
Just remembered, I have an old RCR lying around on this.
Don't forget to vote for...
http://rcrchive.net/rcr/show/332
Currently there exists two very useful functions in ruby.
IO.read( file_name) reads in the entire file into a string.
string.scan( regexp){|match| } scans the entire string for regexp yielding matches.
The limit on doing...
IO.read(file_name).scan( regexp)
is the size of your machines unused physical memory.
Unix has the very handy facility called mmap that allows one to memory
map an entire file and the contents of that file appears mapped into
your virtual address space.
The operating system handles all the fuss and bother of reading (and
forgetting) pages of that file into memory.
Thus is would be very easy to create a mmap'd version, semantically the
same as the following function...
def IO.scan( file_name, regexp, &block)
IO.read(file_name).scan( regexp, &block)
end
But being mmap'd could handle files up (almost) up to 4GB in size.
Problem
IO.read(file_name).scan(regexp) is limited to the available physical
memory on your system.
Proposal
Reimplement...
def IO.scan( file_name, regexp, &block)
IO.read(file_name).scan( regexp, &block)
end
to use unix mmap.
Analysis
No language level change, merely an extension to the existing IO.c
Implementation
Here is some example code.
http://www.cs.purdue.edu/homes/fahmy/cs503/mmap.txt
Where they do the second mmap and the memcpy, we would do the regexp scan.
So that would have to be mashed together with io_read in io.c and
rb_str_scan in string.c
Hmm. Just thinking. Before STL existed I did my own template library in
C++. One of the most useful features was I could mmap a string to a file
and thereafter the entire file behaved as an ordinary string.
The alternate to this RCR would be something that hacked the internal
representation of a ruby string so that the data pointed to was mmap'd.
Now I can think of _many_ uses for that.
However, that would be a far harsher change on the string class and GC
system. Thinking on that a bit more.
One of the Grand Unifying Principles of Unix is...
"Everything (graphics card, directories, sockets, network cards, ....)
is a file, and a File is just a stream of Bytes."
Repeat that until it's firmly stuck in your head.
Now take one small step further.
A stream of bytes is just a (possibly mmap'd) String.
Doesn't that make life really really simple?
Existing implementations!
Similar idea discuss here..
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/7673
Implementation for Unix here...
http://moulon.inra.fr/ruby/mmap.html
Implementation for Win32 here...
http://rubyforge.org/projects/win32utils/
John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : (e-mail address removed)
New Zealand
"We have more to fear from
The Bungling of the Incompetent
Than from the Machinations of the Wicked." (source unknown)