[ ... ]
It seems allocating memory is slow. Do you have some
suggestions?
The usual suggestion for Linux (or anything very similar)
would be to memory map the file.
In this case, anything very similar would include Windows, no?
Windows also supports memory mapping (although obviously with a
different interface---can't make things too simple for the
programmer).
Alternatively, using a system level read to put the data
directly into the memory probably isn't significantly slower
either.
But he didn't really describe what he was doing in detail. If
his only allocation is one large piece of memory up front, it's
not the allocation of this memory which is costing the time.
Similarly, if he's using something like:
ptr = &buffer[ 0 ] ;
std::string line ;
while ( std::getline( source, line ) ) {
ptr = std::copy( line.begin(), line.end(), ptr ) ;
*ptr ++ = '\n' ;
}
(with a bit more error handling, of course), then in a good
implementation, line should soon have the capacity for the
longest line, and there should be no more allocations, either.
(Regretfully, the standard doesn't guarantee this, but most
implementations use the same memory management strategy for
std::string as for std::vector---of the four library
implementations I have at hand, only Sun CC's will ever reduce
the capacity in something like the above.)
On the other hand, anytime he's using a buffer of 1GB, he's
risking page faults, and the copying in the above will be
expensive. On my system, with a file slightly over 400 MB, and
doing an xor over the data after the read (so that I was sure
that everything would be paged in with mmap), using mmap took a
half a second, using a system read directly into the buffer took
one second, and using something like the above (with the xor on
the buffer after) took a little over 2 seconds. (That's elapsed
time, so it's not very precise---there are other things going on
on the system. Also, I'm pretty sure that by the time I got to
measuring, the entire file was already in system cache---the
system has 3 GB of main memory.) FWIW, the simplest utility (wc)
takes almost five seconds on the file, so we can probably
conclude that if there is any serious processing of the data
after it has been read, the differences in time will likely be
negligible.