Fast searching of large files

Stuart Clarke · Jul 1, 2010

Hey all,

Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?

Thanks

Michael Fellinger · Jul 1, 2010

Hey all,

Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?

You can use IO#grep like this:
File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
io.grep(/apiKey/){|m| p io.pos => m } }

The pos is the position the match ended, so just substract the string length.
The above example was a file with 700mb, took around 40s the first
time, 2s subsequently, so disk I/O is the limiting factor in terms of
speed (as usual).
Oh, and also don't use binary encoding if you are dealing with another one

Robert Klemme · Jul 1, 2010

2010/7/1 Michael Fellinger said:
You can use IO#grep like this:
File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
io.grep(/apiKey/){|m| p io.pos => m } }

The pos is the position the match ended, so just substract the string length.
The above example was a file with 700mb, took around 40s the first
time, 2s subsequently, so disk I/O is the limiting factor in terms of
speed (as usual).

If you only need to know whether the string occurs in the file you can do

found = File.foreach("foo").any? {|line| /apiKey/ =~ line}

This will stop searching as soon as the sequence is found.

"fgrep -l foo" is likely faster.

Kind regards

robert

Stuart Clarke · Jul 1, 2010

Thanks.

This seems to be pretty much the best logic for me, however it takes a
good 20 minutes to scan a 2Gb file.

Any ideas?

Thanks

Joel VanderWerf · Jul 1, 2010

Michael said:
You can use IO#grep like this:
File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
io.grep(/apiKey/){|m| p io.pos => m } }

The pos is the position the match ended

Actually, pos will be the position of the end of the line on which the
match was found, because #grep works line by line.

brabuhr · Jul 1, 2010

If you only need to know whether the string occurs in the file you can do
found = File.foreach("foo").any? {|line| /apiKey/ =~ line}
This will stop searching as soon as the sequence is found.

"fgrep -l foo" is likely faster.

irb> `fgrep -l waters /usr/share/dict/words`.size > 0
=> true
irb> `fgrep -l watershed /usr/share/dict/words`.size > 0
=> true
irb> `fgrep -l watershedz /usr/share/dict/words`.size > 0
=> false

irb> `fgrep -ob waters /usr/share/dict/words`.split.map{|s| s.split(':').first}
=> ["153088", "153102", "204143", "234643", "472357", "856441",
"913606", "913613", "913623", "913635", "913646", "913656", "913668",
"913679", "913690", "913703"]
irb> `fgrep -ob watershed /usr/share/dict/words`.split.map{|s|
s.split(':').first}
=> ["913613", "913623", "913635"]
irb> `fgrep -ob watershedz /usr/share/dict/words`.split.map{|s|
s.split(':').first}
=> []

Roger Pack · Jul 1, 2010

Stuart said:
Hey all,

Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?

a fast way is to do it in C

Here are a few other helpers, though:

1.9 has faster regexes
boost regexes: http://github.com/michaeledgar/ruby-boost-regex (you
could probably optimize it more than it currently is, as well...)

Rubinius also might help.

Also make sure to open your file in binary mode if you're on 1.9. That
reads much faster. If that's an option, anyway.
GL.
-rp

I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	3	Jun 4, 2023
Fast alternatives to "File" and "IO" for large numbers of files ?	6	Feb 24, 2011
Find and count strings of text from multiple files	17	Dec 16, 2021
Fast way to process large files line by line	18	Nov 15, 2006
Optimize write of large file	6	May 12, 2011
fast copying of large files in python	1	Nov 2, 2011
fast regex	15	May 6, 2010
Search a Large files backwards	7	Mar 2, 2010

Fast searching of large files

Stuart Clarke

Michael Fellinger

Robert Klemme

Stuart Clarke

Joel VanderWerf

brabuhr

Roger Pack

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads