Improving efficiency of algorithm

C

Comp1597

I have a function with signature: unsigned f(unsigned);

My task is to read positive integers from a file (call each number x)
and to write f(x) on standard output

My code goes like this:

BEGINNING OF CODE
unsigned ComputeThis;
// ReadFromThisFile is a stream bound to a text file
while (ReadFromThisFile >> ComputeThis)
std::cout << f(ComputeThis)<<"\n";
END OF CODE

Assume that the text file is very large. Can anyone suggest
strategies to improve the efficiency of this code in terms of either
caching values of ComputeThis that occur often, or speeding up the
stream reading and writing process?

Or other efficiency gains?

Feel free to suggest general ideas or areas or computer science/ c++
to study that are relevant to this problem

Thanks a lot.
 
A

Alf P. Steinbach

* (e-mail address removed):
I have a function with signature: unsigned f(unsigned);

Unless f does some bitlevel stuff I recommend using plain 'int', not 'unsigned'.

My task is to read positive integers from a file (call each number x)
and to write f(x) on standard output

My code goes like this:

BEGINNING OF CODE
unsigned ComputeThis;
// ReadFromThisFile is a stream bound to a text file
while (ReadFromThisFile >> ComputeThis)
std::cout << f(ComputeThis)<<"\n";
END OF CODE

'ReadFromThisFile', being an imperative, is IMHO ungood as a name for a stream.

Instead use something like 'NumberStream', a *descriptive* name.

Assume that the text file is very large. Can anyone suggest
strategies to improve the efficiency of this code in terms of either
caching values of ComputeThis that occur often, or speeding up the
stream reading and writing process?

Generally that's done by buffering.

I'm not very familiar with iostreams because I really hate them (for reasons
discussed earlier and too much to discuss here), but the general idea for such
speedup is to increase the stream's buffer size. Which as I recall is not very
easy with iostreams. However, it's easy with a FILE*.

Or other efficiency gains?

You should turn off sync_with_stdio.

Feel free to suggest general ideas or areas or computer science/ c++
to study that are relevant to this problem

The most important is the question of when to (not) optimize.

First of all make sure that your code is *correct*.

Don't care about micro level performance at this point, just make really sure
that it is correct. That means, among other things, /writing/ a clear-cut
technical specification, /creating/ automated tests and manual reproducible test
procedures, and doing them, and not the least, and I mean place that at the
front of issues to deal with, the /reality check/ of whether your program really
solves the higher level problem it's meant to solve (in your case there's
probably not a customer but in general, before spending zillions on development
make sure that what's developed will actually serve the customer's needs and not
introduce new severe problems, present and projected to future).

Then, before doing anything else except "known measures" that should be done as
a matter of course such as the one mentioned above, *measure*.

If performance is adequate, do nothing.


Cheers & hth.,

- Alf
 
C

Comp1597

* (e-mail address removed):


Unless f does some bitlevel stuff I recommend using plain 'int', not 'unsigned'.




'ReadFromThisFile', being an imperative, is IMHO ungood as a name for a stream.

Instead use something like 'NumberStream', a *descriptive* name.


Generally that's done by buffering.

I'm not very familiar with iostreams because I really hate them (for reasons
discussed earlier and too much to discuss here), but the general idea for such
speedup is to increase the stream's buffer size. Which as I recall is not very
easy with iostreams. However, it's easy with a FILE*.


You should turn off sync_with_stdio.


The most important is the question of when to (not) optimize.

First of all make sure that your code is *correct*.

Don't care about micro level performance at this point, just make really sure
that it is correct. That means, among other things, /writing/ a clear-cut
technical specification, /creating/ automated tests and manual reproducible test
procedures, and doing them, and not the least, and I mean place that at the
front of issues to deal with, the /reality check/ of whether your program really
solves the higher level problem it's meant to solve (in your case there's
probably not a customer but in general, before spending zillions on development
make sure that what's developed will actually serve the customer's needs and not
introduce new severe problems, present and projected to future).

Then, before doing anything else except "known measures" that should be done as
a matter of course such as the one mentioned above, *measure*.

If performance is adequate, do nothing.

Cheers & hth.,

- Alf

Thanks, Alf. The names such as ReadFromThisFile are bogus. I am
changing names for confidentiality reasons.
 
C

Comp1597

* (e-mail address removed):


Unless f does some bitlevel stuff I recommend using plain 'int', not 'unsigned'.




'ReadFromThisFile', being an imperative, is IMHO ungood as a name for a stream.

Instead use something like 'NumberStream', a *descriptive* name.


Generally that's done by buffering.

I'm not very familiar with iostreams because I really hate them (for reasons
discussed earlier and too much to discuss here), but the general idea for such
speedup is to increase the stream's buffer size. Which as I recall is not very
easy with iostreams. However, it's easy with a FILE*.


You should turn off sync_with_stdio.


The most important is the question of when to (not) optimize.

First of all make sure that your code is *correct*.

Don't care about micro level performance at this point, just make really sure
that it is correct. That means, among other things, /writing/ a clear-cut
technical specification, /creating/ automated tests and manual reproducible test
procedures, and doing them, and not the least, and I mean place that at the
front of issues to deal with, the /reality check/ of whether your program really
solves the higher level problem it's meant to solve (in your case there's
probably not a customer but in general, before spending zillions on development
make sure that what's developed will actually serve the customer's needs and not
introduce new severe problems, present and projected to future).

Then, before doing anything else except "known measures" that should be done as
a matter of course such as the one mentioned above, *measure*.

If performance is adequate, do nothing.

Cheers & hth.,

- Alf
 
C

Comp1597

* (e-mail address removed):


Unless f does some bitlevel stuff I recommend using plain 'int', not 'unsigned'.




'ReadFromThisFile', being an imperative, is IMHO ungood as a name for a stream.

Instead use something like 'NumberStream', a *descriptive* name.


Generally that's done by buffering.

I'm not very familiar with iostreams because I really hate them (for reasons
discussed earlier and too much to discuss here), but the general idea for such
speedup is to increase the stream's buffer size. Which as I recall is not very
easy with iostreams. However, it's easy with a FILE*.


You should turn off sync_with_stdio.


The most important is the question of when to (not) optimize.

First of all make sure that your code is *correct*.

Don't care about micro level performance at this point, just make really sure
that it is correct. That means, among other things, /writing/ a clear-cut
technical specification, /creating/ automated tests and manual reproducible test
procedures, and doing them, and not the least, and I mean place that at the
front of issues to deal with, the /reality check/ of whether your program really
solves the higher level problem it's meant to solve (in your case there's
probably not a customer but in general, before spending zillions on development
make sure that what's developed will actually serve the customer's needs and not
introduce new severe problems, present and projected to future).

Then, before doing anything else except "known measures" that should be done as
a matter of course such as the one mentioned above, *measure*.

If performance is adequate, do nothing.

Cheers & hth.,

- Alf

Alf,

I tried the sync_with_stdio(false) trick together with cin.tie(0) but
observed no increase in speed on large data sets. Perhaps even a
decrease though it's not a statistically large enough sample.

I'm using the express version of visual c++ on Windows Vista.

As to whether "performance is adequate" it's hard for me to tell. I
only have access to data sets of 100,000 numbers but the algorithm
needs to work at reasonable speeds (preferably less than a minute) for
data sets of tens of millions.

Thanks
 
J

James Kanze

* (e-mail address removed):
Unless f does some bitlevel stuff I recommend using plain
'int', not 'unsigned'.
'ReadFromThisFile', being an imperative, is IMHO ungood as a
name for a stream.
Instead use something like 'NumberStream', a *descriptive* name.

I'd recommend something indicating the direction in the name as
well. Typically, I'll just use something like input or source,
although a qualified noun (numberInputStream) would be better.
(But I'm lazy, and if only one input stream is visible...)
Generally that's done by buffering.
I'm not very familiar with iostreams because I really hate
them (for reasons discussed earlier and too much to discuss
here), but the general idea for such speedup is to increase
the stream's buffer size. Which as I recall is not very easy
with iostreams. However, it's easy with a FILE*.

It works exactly the same in both cases. However, a good
implementation of filebuf or FILE will use the optimal buffer
size by default; changing it isn't likely to gain anything, and
may cost.
You should turn off sync_with_stdio.

That should only affect the standard stream objects (std::cin,
etc.). It also shouldn't make much difference on them, if the C
and the C++ libraries are integrated together. It only becomes
important when the C library doesn't provide a means for the C++
library to know whether it has buffered data or not.

Still, it's probably not a bad habit to get into, if you know
that your programs don't make use of stdin, stdout or stderr.

Of course, what he was asking for was some way to recognize that
the next number to be input will be the same as a number already
read, and to skip reading it. For that, you have to activate
the non-standard cristal ball option on the istream, and it's
not widely implemented.
 
C

Comp1597

Of course, what he was asking for was some way to recognize that
the next number to be input will be the same as a number already
read, and to skip reading it.
....

You misread. I never said or suggested anything remotely like this.
My software could notice that, for example, the number 2 has been
encountered frequently _in previous values in the current stream_ and
remember the value of f rather than calculating it freshly each time.
I would have thought that this is a standard computer science problem:
whether the efficiency gains from such caching outweigh the costs of
storing the values.  
 
A

Alf P. Steinbach

* James Kanze:
It works exactly the same in both cases.

Could you give an example?


However, a good
implementation of filebuf or FILE will use the optimal buffer
size by default; changing it isn't likely to gain anything, and
may cost.

I very much doubt this statement, unless it's meant as a definition of what
"good" would be if it existed, instead of a description of the typical "good".

That should only affect the standard stream objects (std::cin,
etc.).

The OP's output is to the standard output stream.

It also shouldn't make much difference on them, if the C
and the C++ libraries are integrated together. It only becomes
important when the C library doesn't provide a means for the C++
library to know whether it has buffered data or not.

The OP reported no observed difference. However, it can't hurt unless the
implementation is really perverse. And I think, based on experience with that,
that those measurements are suspect.

Still, it's probably not a bad habit to get into, if you know
that your programs don't make use of stdin, stdout or stderr.


Cheers & hth.,

- Alf
 
K

Kai-Uwe Bux

...

You misread. I never said or suggested anything remotely like this.
My software could notice that, for example, the number 2 has been
encountered frequently _in previous values in the current stream_ and
remember the value of f rather than calculating it freshly each time.
I would have thought that this is a standard computer science problem:
whether the efficiency gains from such caching outweigh the costs of
storing the values.

Yes, that _is_ a standard problem; and the standard answer is: you find out
by measuring both alternatives. (Note that this is an oversimplification.
For the caching, even if you settle on a fixed caching logic and
implementation of the cache, you usually can vary the cache size. Larger
cache size means less values need to be re-computed, but searching takes
longer.)

Also, a lot depends on the particular machine; and the solution that wins on
one may not win on another.


Best

Kai-Uwe Bux
 
J

James Kanze

* James Kanze:
Could you give an example?

stream.rdbuf()->pubsetbuf( buffer, size ) ;

This is one thing that the standard changed, with regards to the
classical streams (where the function was just setbuf). For no
good reason that I can see.
I very much doubt this statement, unless it's meant as a
definition of what "good" would be if it existed, instead of a
description of the typical "good".

It's been my experience (and I've had to deal with extremely
large files more than once).

There are exceptions: on a 16 bit processor without virtual
memory, it did turn out advantageous in one program to use a 32
KB buffer on one stream---obviously, the implementation of FILE
didn't use 32 KB buffers by default. But on a system with
virtual memory, the implementation is very deficient if the size
of the buffer isn't close to the system optimum (In some cases,
it may be an advantage to supply a buffer with particular
alignement, e.g. exactly on a page boundary. Although in such
cases, it wouldn't surprise me if an implementation used mmap.)

Another thing which *could* make a difference, if the
implementation so decided, is opening in binary mode. Since
most of my experience is under Unix, where binary and text modes
are identical, I have no experience with this, but
theoretically, at least, under Windows, it should be possible to
do IO in binary mode faster than in text mode.
The OP's output is to the standard output stream.

I'd missed that. My impression was that he was trying to
optimize input.
The OP reported no observed difference. However, it can't hurt
unless the implementation is really perverse. And I think,
based on experience with that, that those measurements are
suspect.

It can't hurt, and that's why I said that it's not a bad habit
to get into. (In other words: you should do it by habit,
anytime you know that the standard C streams aren't used in your
code.) That the OP reported no observed difference only means
that output wasn't the bottleneck. (From a later posting of
his, I gather that it was calculating the value to be output,
and not IO at all, that is the bottle neck.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,160
Messages
2,570,890
Members
47,423
Latest member
henerygril

Latest Threads

Top