File Read Performance Issue

M

Mike Copeland

I am developing an application that reads and stores data from 2
large text files. One file (5750kb) has ~160,000 records and the other
(23,000kb) has ~330,000 records. The program reads both files and
converts/stores their data into vectors. Also, some indices to unique
records are developed (much like a telephone book), so that searches can
be efficiently done.
The reads at program start take a long time (~2 minutes), and
although I use buffers of 4096 size, I can't see other ways to improve
this aspect of the program's performance. The data this program uses
will continue to grow, so I am concerned about its viability. Any
thoughts? TIA
 
I

Ian Collins

Mike said:
I am developing an application that reads and stores data from 2
large text files. One file (5750kb) has ~160,000 records and the other
(23,000kb) has ~330,000 records. The program reads both files and
converts/stores their data into vectors. Also, some indices to unique
records are developed (much like a telephone book), so that searches can
be efficiently done.
The reads at program start take a long time (~2 minutes), and
although I use buffers of 4096 size, I can't see other ways to improve
this aspect of the program's performance. The data this program uses
will continue to grow, so I am concerned about its viability. Any
thoughts? TIA

Where does your profiler tell you it is spending its time? There are
too many different possibilities to give a decent answer without knowing
where the bottleneck is.
 
M

Mike Copeland

Where does your profiler tell you it is spending its time? There are
too many different possibilities to give a decent answer without knowing
where the bottleneck is.

I don't have a profiler. However, I have tried 3 different "read"
techniques (getline, fgets, a hybrid fgets) and have found that there is
a base I/o factor that doesn't change. This test of mine eliminated all
of the "storing" process, and the raw I/o numbers seem to indicate the
problem is in the I/o runtime area.
FWIW, using "getline" doubles the raw test times of the "fgets"
techniques - there's a real extra overhead with this technique.
I'm seeking here some sort of dramatic improvement to the I/o
process, and I don't know if there's a better/faster way to read lots of
text data records.
 
I

Ian Collins

Mike said:
I don't have a profiler.

Doesn't every tool-set have one?
However, I have tried 3 different "read"
techniques (getline, fgets, a hybrid fgets) and have found that there is
a base I/o factor that doesn't change. This test of mine eliminated all
of the "storing" process, and the raw I/o numbers seem to indicate the
problem is in the I/o runtime area.
FWIW, using "getline" doubles the raw test times of the "fgets"
techniques - there's a real extra overhead with this technique.
I'm seeking here some sort of dramatic improvement to the I/o
process, and I don't know if there's a better/faster way to read lots of
text data records.

You're doing exactly what I hinted not to do, random guessing. As I
said, there are many different possible causes of your slowness. For
example are you resizing vectors too often and running into allocation
and copy overheads? Is your I/O subsystem slow?

Without measuring, you are taking stabs in the dark and most likely
wasting time fixing something that isn't broken.

If you can't profile (which I'm sure you can), try something like just
reading and not saving the data. Then try reading all the data a some
temporary buffer then load up your real structures. Time both.
 
N

Nobody

I don't have a profiler. However, I have tried 3 different "read"
techniques (getline, fgets, a hybrid fgets) and have found that there is a
base I/o factor that doesn't change. This test of mine eliminated all of
the "storing" process, and the raw I/o numbers seem to indicate the
problem is in the I/o runtime area.
FWIW, using "getline" doubles the raw test times of the "fgets"
techniques - there's a real extra overhead with this technique.
I'm seeking here some sort of dramatic improvement to the I/o
process, and I don't know if there's a better/faster way to read lots of
text data records.

Other things to try include:

1. Using significantly larger buffers.
2. Using the OS' "native" I/O functions e.g. read() or ReadFile().
3. Both 1 and 2.
4. Using C's fread().
5. 4 and 1.
6. 4, with setvbuf(fp, NULL, _IONBF, 0);
7. Memory mapping, e.g. mmap() or CreateFileMapping().
 
I

Ike Naar

I don't have a profiler. However, I have tried 3 different "read"
techniques (getline, fgets, a hybrid fgets) and have found that there is
a base I/o factor that doesn't change. This test of mine eliminated all
of the "storing" process, and the raw I/o numbers seem to indicate the
problem is in the I/o runtime area.
FWIW, using "getline" doubles the raw test times of the "fgets"
techniques - there's a real extra overhead with this technique.
I'm seeking here some sort of dramatic improvement to the I/o
process, and I don't know if there's a better/faster way to read lots of
text data records.

Is your input file stored on a very slow medium?
On the machine I'm using right now, reading a 23MB file from disk,
using getline in a loop, takes less than a second.
You could try to time some other utility program, such as

time wc -l <your_input_file

or

time cat <your_input_file >/dev/null

to get an idea how long reading your input file should reasonably take.
 
M

Marcel Müller

I am developing an application that reads and stores data from 2
large text files. One file (5750kb) has ~160,000 records and the other
(23,000kb) has ~330,000 records. The program reads both files and
converts/stores their data into vectors. Also, some indices to unique
records are developed (much like a telephone book), so that searches can
be efficiently done.
The reads at program start take a long time (~2 minutes), and
although I use buffers of 4096 size,

4096 is no buffer nowadays. The break even of modern HDDs is above 1MB.
So if you think, that your task is I/O bound (low CPU usage), then
significantly increase the buffer size.

If your task is CPU bound (high CPU load) then this won't help at all.
Depending on your access pattern and the file system driver many
platforms detect that you intend to read the entire file and implicitly
use a large read ahead cache. So you need no larger buffer.
I can't see other ways to improve
this aspect of the program's performance. The data this program uses
will continue to grow, so I am concerned about its viability. Any
thoughts? TIA

Do you have some linear access to your data structures? This will likely
cause O(n^2) performance which is really bad. This happens e.g. if you
use sorted vectors.

What about allocations? I guess each record is stored in a separate
vector instance. Is each instance allocated seperately? Depending on the
performance of the allocators of your runtime different results are
possible. I have seen very smart implementations as well as very bad ones.

You could further improve performance if you read the two files in parallel.


Marcel
 
C

Carlo Milanesi

Mike said:
I am developing an application that reads and stores data from 2
large text files. One file (5750kb) has ~160,000 records
and the other
(23,000kb) has ~330,000 records. The program reads both files and
converts/stores their data into vectors. Also, some indices to unique
records are developed (much like a telephone book),
so that searches can
be efficiently done.
The reads at program start take a long time (~2 minutes), and
although I use buffers of 4096 size, I can't see other ways to improve
this aspect of the program's performance. The data this program uses
will continue to grow, so I am concerned about its viability.

Probably your program has a bug or uses a very slow medium, because
modern hard-disks or flash-memories can be read at tens of MB per second.
However, memory-mapped files may improve the performance of your program:
http://en.wikibooks.org/wiki/Optimi...on_techniques/Input/Output#Memory-mapped_file
 
J

Jorgen Grahn

4096 is no buffer nowadays. The break even of modern HDDs is above 1MB.
So if you think, that your task is I/O bound (low CPU usage), then
significantly increase the buffer size.

Depends on what he means by "buffers of 4096 size". If I naively use
iostreams to read a text file line by line, it maps to reading 8191
bytes via the kernel interface, read(2). I hope the kernel at that
point has read /more/ than 8191 bytes into its cache.
If your task is CPU bound (high CPU load) then this won't help at all.

Yeah. As we all know, his problem lies somewhere else. His 2 minutes
is four hundred times slower than the 280 ms I measure on my ancient
hardware -- no misuse of the I/O facilities can make such a radical
difference.

/Jorgen
 
A

Andreas Dehmel

On Fri, 16 Aug 2013 18:41:48 -0700
I am developing an application that reads and stores data from 2
large text files. One file (5750kb) has ~160,000 records and the
other (23,000kb) has ~330,000 records. The program reads both files
and converts/stores their data into vectors. Also, some indices to
unique records are developed (much like a telephone book), so that
searches can be efficiently done.
The reads at program start take a long time (~2 minutes), and
although I use buffers of 4096 size, I can't see other ways to
improve this aspect of the program's performance. The data this
program uses will continue to grow, so I am concerned about its
viability. Any thoughts? TIA

This data is tiny by today's standards and should easily be read
within seconds (lower single digits, if that). If you have a slowdown
of this magnitude, your problems are pretty much guaranteed to be
on the algorithmic side, not IO-related. You most likely have a
quadratic algorithm hidden in there somewhere and that's the one
you have to replace with something sane.

Without access to your code, it's impossible to be more specific.
Your "indices to unique records" sound suspicious, for instance,
because dumb "unique"-algorithms are quadratic; try whether disabling
those temporarily will fix the problem and continue from there. Apart
from that, if you read into vectors (as in std::vector) via push_back,
this is not quadratic and should be fine, but beware if you do something
like resize() or reserve() per record, because this _will_ degrade
filling the vector into quadratic complexity. A final candidate would
be your record structures, if these are very expensive to copy and
you're allowed to use C++11-features, you might consider adding
move-semantics (if your records are just a bunch of POD, don't
bother).



Andreas
 
J

Jorgen Grahn

Unfortunately C++ streams tend to be significantly slower than the
C equivalents in the vast majority of implementation (every single
one I have ever used.) With small amounts of data it doesn't matter,
but when the data amount increases, it starts to show.

If you are using C++ streams, switching to (properly used) C I/O
functions could well cut that time to 1 minute or less. (Further
optimization will have to be done at a higher level.)

I was going to challenge that "but iostreams are slower!" idea but it
looks like that here, too. On my Linux box, a line counting loop is
about twice as fast with fgets() compared to std::getline(stream,
string&). I was expecting a 20% difference or something.

On the other hand, both are pretty fast: 100--200 nanoseconds per line.

Interestingly, the fgets() version makes twice as many read() calls.

/Jorgen
 
M

Marcel Müller

Depends on what he means by "buffers of 4096 size". If I naively use
iostreams to read a text file line by line, it maps to reading 8191
bytes via the kernel interface, read(2). I hope the kernel at that
point has read /more/ than 8191 bytes into its cache.

Yes, but for reading the kernel needs to estimate the next blocks that
you are going to read. This works best for platforms where the disk
cache is located at the filesystem driver. It works less well when the
cache is part of the block device driver, because the next LBA is not
necessarily the next block of your file. And it usually works very bad
if you do seek operations.
Yeah. As we all know, his problem lies somewhere else. His 2 minutes
is four hundred times slower than the 280 ms I measure on my ancient
hardware -- no misuse of the I/O facilities can make such a radical
difference.

Anything is possible, but where did you get the numbers?


Marcel
 
J

Jorgen Grahn

Yes, but for reading the kernel needs to estimate the next blocks that
you are going to read. This works best for platforms where the disk
cache is located at the filesystem driver. It works less well when the
cache is part of the block device driver, because the next LBA is not
necessarily the next block of your file. And it usually works very bad
if you do seek operations.


Anything is possible, but where did you get the numbers?

280ms? Why -- do you doubt it? I simply found a file of the right size
and (in Unix) timed 'cat file >/dev/null' or 'wc -l file' or
something. The file was in cache at that point.

/Jorgen
 
J

Jorgen Grahn

Yes, the iostreams slowdown factor is about 2 also in my experience, also
for other usage than std::getline(). For example, concatenating large
strings from pieces with std::string::eek:perator+= seems to be about twice
faster than using std::eek:stringstream. My guess is this is so because of
massive virtual function calls.

Hm, they always say iostreams are designed so that you're not doing a
virtual function call per character -- that buffer thing. So that
theory seems a bit unlikely to me. But I don't have a better one.

I suspected a performance hit from locale usage, but if that's involved
switching to the plain "C" locale didn't help.

Lastly: ten years ago I could believe my C++ (libstdc++, Gnu/Linux)
wasn't optimized. But by now surely someone has tried to fix at least
this common case: reading std::cin line by line? Strange.

/Jorgen
 
M

Melzzzzz

Hm, they always say iostreams are designed so that you're not doing a
virtual function call per character -- that buffer thing. So that
theory seems a bit unlikely to me. But I don't have a better one.

I suspected a performance hit from locale usage, but if that's
involved switching to the plain "C" locale didn't help.

Lastly: ten years ago I could believe my C++ (libstdc++, Gnu/Linux)
wasn't optimized. But by now surely someone has tried to fix at least
this common case: reading std::cin line by line? Strange.

With g++ stdlib implementation you have to
std::ios_base::sync_with_stdio(false) in order to get good
performance.
 
J

Jorgen Grahn

With g++ stdlib implementation you have to
std::ios_base::sync_with_stdio(false) in order to get good
performance.

I was just about to say "of course I had that one; it's in main() in
all my programs!" ... but I might have forgotten this time. I'll
look into it tonight.

/Jorgen
 
M

Melzzzzz

This would not explain the difference between std::eek:stringstream vs
std::string concatenation, would it?

No. This is for fast cin/cout operations. Eg reading file from std::cin
with getline.
 
J

Jorgen Grahn

I was just about to say "of course I had that one; it's in main() in
all my programs!" ... but I might have forgotten this time. I'll
look into it tonight.

Turns out I hadn't forgotten. So, a factor 2 difference not caused by
the stdin/std::cin connection.

But when I now try omitting std::cin::sync_with_stdio(false),
I get surprising results: a slowdown by another factor 30! To
summarize the three results, for a file of 500MB and 6.7 million
lines:

0.7s - 1.2s - 34s

The 30x slowdown is even harder to find an excuse for. Ok, so it's
avoidable -- but the workaround is obscure. I think I learned about
it here rather recently.

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,783
Latest member
RickeyDort

Latest Threads

Top