Chris Torek said:
If you are repeatedly searching for strings within a large file
(i.e., rewriting the venerable "grep" utility), and want it to be
as fast as possible as well as accurate, you almost certainly should
*not* be using any variant of strstr(), but rather some variant
of Boyer-Moore search.
Yes, but not in this case. I tried Boyer-Moore, it works well in
certain cases, especially large files for dumb string searches. In the
program I was writing I narrow down the search area to a minimum
first, by looking for a "token". Because of that, the individual
searches are actually small. Examples of the lengths of the search
strings are:
1. 24 (needle) in 35 (haystack)
2. 28 (needle) in 14 (haystack)
A whole lot of needles (thousands) are to be searched in on average 10
small haystacks. Needles are usually smaller than haystacks.
In the first case, a maximum of 35-24=11 searches are needed (sliding
window). No match is expected more than 99.99% of all searches.
In the second case, no compare is necessary, but I'll let the
strcasestr from libc decide that. The length of needle is not known as
I've written it now, see below for code. May be this outweighs the
extra code needed to calculate string length. I could do some sorting
too, either by file length or an alphabetical sort before doing
compares. That would limit the amount of compares a bit.
Except for Boyer-Moore, the best string searching algorithms are
generally O(N) or slower (and often much slower), where N is the
amount of text to be searched -- in this case, the size of the
file. Boyer-Moore, however, is generally O(N/M), where M is the
length of the desired match. When searching for "abcdefghij", for
instance, you may be able to examine only about 10% of the file.
After checking 100% of the file for tokens, I estimate the examined
parts at about 5% in an average file, much less in larger than average
files. I have tried a few dozen case insensitive strstr variants for
speed and accuracy. The one from libc performed far better (>20%
faster) for this application than any other.
Also -- and somewhat incidentally -- reading the entire file into
memory first, then searching, could be one of the slower methods
to use, because it rules out parallelism. If you read the initial
part of the file, then instruct the I/O subsystem to read the next
part of the file while you work on the first part read so far, then
instruct the I/O subsystem to read the third part while you work
on the second, and so on, you may be able to finish the search in
only slightly longer than it takes to read the entire file, rather
than in "total time time to read file + total time to search".
Do you mean multi-threading? I've tried this without multi-threading.
It takes more time to read small parts and perform pointer assignments
when the file is read and processed in chunks.
It is known that the file is likely already in memory cache.
Ideally (but not always actually), the best way to get this
parallelism in C is simply to go through ordinary stdio, using
fread() or fgets() or getchar() or whatever is most suitable to
the algorithm at hand.
I used fread() to read the file at once. The source file is user
input, it cannot be trusted to be properly formatted.
These are parts of the code to process the file, terminate lines as
strings and set the pointers, which will be used later to perform the
searches. It skips empty lines.
char **filepointers;
long max_pointers=300000;
indicator=0;
filepointers = malloc(max_pointers * sizeof *filepointers);
sfile is a large list of needles already read into memory from file.
The length of this file is known as lenfile.
d=0;
// point to first line, if it exists at position zero
if (sfile[0]!='\x0D' && sfile[0]!='\x0A')
{ filepointers[0]=sfile; // set pointer
d++; // next pointer
}
// set pointers
for (c=0;c<lenfile; c++)
{ if (sfile[c]=='\x0D' || sfile[c]=='\x0A')
{ sfile[c]=0; // terminate string
indicator=1; // new string coming up (possibly)
}
else
{ if (indicator==1) // start new string
{ if (d==max_pointers-1) break; // don't overflow
filepointers[d]=sfile+c; // set pointer
indicator=0; // reset indicator
d++; // next pointer
}
}
}
filepointersnumber=d; // number of pointers counted
This code still takes nearly as much time as reading the file from
file cache memory. For an extremely large needle file, it's a bit less
than the time the searches take. May be this helps to gauge the
impact.
read file from memory cache : assing pointer : perform searches
1.1 : 1 : 1.2
There may be 0x0 chars inside the lines. In that case, the search is
limited to the first 0x0 encountered. To counter such problems,
another array can be created which holds line lengths. Terminating
lines won't be necessary. Case insensitive memory compares can be used
instead of strcasestr. Only if the needle length is equal or smaller
than the length of the haystack a compare will performed by calling a
function.
Gerald