Efficency

C

Chris Hatton

I realize this is not a newsgroup for the total novice or homework
questions so this will be the last post until I become more familiar
with the language.

I'm looking to optimize a piece of code that reads characters from a
text file and counts the frequency of them. If someone could point me in
the right direction that would be great. I'm not looking for any code
just a couple of links to good websites as my time is very limited at
the moment.


thanks, and apologies if this post is not appropriate.


Chris
 
B

Ben Pfaff

Chris Hatton said:
I'm looking to optimize a piece of code that reads characters from a
text file and counts the frequency of them. If someone could point me
in the right direction that would be great. I'm not looking for any
code just a couple of links to good websites as my time is very
limited at the moment.

Such code, if written in any kind of sensible way, will probably
be limited by the speed that you can read data from the input
file. I doubt there's much potential for optimization.
 
D

Dave Vandervies

I realize this is not a newsgroup for the total novice or homework
questions so this will be the last post until I become more familiar
with the language.

Well, it's for discussion about C. We're not here to teach you the
language or do your homework, but asking questions (even simple ones,
if they're not the ones that are so simple we've seen them INT_MAX times
already) is usually a good way to get an interesting discussion started.

I'm looking to optimize a piece of code that reads characters from a
text file and counts the frequency of them. If someone could point me in
the right direction that would be great. I'm not looking for any code
just a couple of links to good websites as my time is very limited at
the moment.

As noted already, if your code is written sensibly you shouldn't have
much optimizing left to do.

The standard library's I/O system will most likely already be reading
from the file as efficiently as possibly and buffering it for your
character-at-a-time read code, so the main loop will be doing something
like this:
--------
get character from buffer
check whether buffer needs to be refilled (it usually won't)
do stuff with character
--------
....and the "do stuff with character" part would have to be pretty
inefficient for the time taken to do this to be large enough to be
noticeable compared to the time spent refilling the buffer when that's
necessary.


The general rule is to leave the micro-optimization to the compiler,
and leave any other optimization until after it's shown to be needed.


That said, if you really want to make sure that the parts that you
can optimize yourself are as efficient as possible, it's worth noting
that characters are just small integers and that array indices are
also integers. I'll let you figure out where to go from there - if you
haven't already seen something like what I'm thinking of, it'll probably
end up being more valuable as an exercise in finding new ways to solve a
problem (and, hopefully, give you another tool to use for later problems)
than as an optimization exercise. (Be sure to show us[1] your code once
you've finished so we can point out all the problems in it. If you're
just learning the language, there WILL be a bunch.)


dave

[1] Or a local guru, if you have one. Or, better yet, both.
 
C

CBFalconer

Chris said:
I realize this is not a newsgroup for the total novice or homework
questions so this will be the last post until I become more familiar
with the language.

I'm looking to optimize a piece of code that reads characters from a
text file and counts the frequency of them. If someone could point
me in the right direction that would be great. I'm not looking for
any code just a couple of links to good websites as my time is very
limited at the moment.

Without any suitable includes etc. Needs limits and stdio at
least.

/* automatically zeroed */
static unsigned long counts[UCHAR_MAX+1];
....
int ch;
FILE *f;
....
/* open files etc */
....
while (EOF != (ch = getc(f))) counts[ch]++;
....
/* display the content of counts */
 
S

Stan Milam

CBFalconer said:
Chris said:
I realize this is not a newsgroup for the total novice or homework
questions so this will be the last post until I become more familiar
with the language.

I'm looking to optimize a piece of code that reads characters from a
text file and counts the frequency of them. If someone could point
me in the right direction that would be great. I'm not looking for
any code just a couple of links to good websites as my time is very
limited at the moment.


Without any suitable includes etc. Needs limits and stdio at
least.

/* automatically zeroed */
static unsigned long counts[UCHAR_MAX+1];
....
int ch;
FILE *f;
....
/* open files etc */
....
while (EOF != (ch = getc(f))) counts[ch]++;
....
/* display the content of counts */

Damn! I knew someone would beat me to it. A question I could actually
answer, too.

--
Regards,
Stan Milam
=============================================================
Charter Member of The Society for Mediocre Guitar Playing on
Expensive Instruments, Ltd.
=============================================================
 
I

infobahn

Stan said:
Chris said:
I'm looking to optimize a piece of code that reads characters from a
text file and counts the frequency of them. [...]

Without any suitable includes etc. Needs limits and stdio at
least.

/* automatically zeroed */
static unsigned long counts[UCHAR_MAX+1];
....
int ch;
FILE *f;
....
/* open files etc */
....
while (EOF != (ch = getc(f))) counts[ch]++;
....
/* display the content of counts */

Damn! I knew someone would beat me to it. A question I could actually
answer, too.

You missed the chance to crit him, too. The static array makes the
program unsuitable for functionalising; he could easily have got
the desired zeroing behaviour using = {0}. He also forgot to point
out a few rules of optimisation, such as "don't do it", for instance.
=============================================================
Charter Member of The Society for Mediocre Guitar Playing on
Expensive Instruments, Ltd.
=============================================================

Where do I sign up?
 
W

websnarf

Ben said:
Such code, if written in any kind of sensible way, will probably
be limited by the speed that you can read data from the input
file.

This is correct.
[...] I doubt there's much potential for optimization.

But this is not. :)

First of all there is a big difference in performance between using
character by character reading via fgets() and block reading via
fread(). Obviously use the later if possible (you can't use it on
stdin unless you are using POSIX and isatty() returns 0). Ok, second
of all, if the data itself is large enough, then having the whole thing
in memory will cause additional file activity because of memory
swapping. Even data going outside of your CPU's on-chip L1 cache will
slow things down a bit.

So this leads us to the inevitable conclusion that you should read
moderately small sized blocks (something that fits in your L1 cache,
but large enough to marginalize the overhead for calling fread() --
something like a 1K to 4K block is appropriate for just about any
existing system.) Then just do a frequency count of each character in
the block from there.

Hint: You'll *demolish* the performance of what CBFalconer posted if
you do this.
 
C

CBFalconer

.... snip ...

So this leads us to the inevitable conclusion that you should read
moderately small sized blocks (something that fits in your L1 cache,
but large enough to marginalize the overhead for calling fread() --
something like a 1K to 4K block is appropriate for just about any
existing system.) Then just do a frequency count of each character
in the block from there.

Hint: You'll *demolish* the performance of what CBFalconer posted if
you do this.

If you are referring to the reply where I said:

"while (EOF != (ch = getc(f))) counts[ch]++;"

you might be surprised. You will note that I used getc, which may
be a macro, and may operate directly on the system buffers. It
doesn't have to be a macro, that depends on many things including
QOI. If it is, it is likely to expand (inline) to something like:

if (!(f->ix)) _fillbuff(f);
if (f->eof) return EOF;
else return f->buff[ix--];
 
T

Thomas Matthews

Chris said:
I realize this is not a newsgroup for the total novice or homework
questions so this will be the last post until I become more familiar
with the language.

I'm looking to optimize a piece of code that reads characters from a
text file and counts the frequency of them. If someone could point me in
the right direction that would be great. I'm not looking for any code
just a couple of links to good websites as my time is very limited at
the moment.


thanks, and apologies if this post is not appropriate.


Chris

The ultimate efficiency is to read the entire text file
into memory then process the data, place results in
a buffer, then output the buffer.

The objectives are:
1. Keep the I/O continuous.
Don't break it up.

2. Process data in memory.
In most circumstances, processing data in memory
is faster than processing directly from I/O.

3. Use processor registers.
Design your code so the compiler can maximize the
use of the processor.

4. Use other processors.
If your platform has a DMA device that can place
data from a text file into memory without burdening
the main processor, use it.

Remember these points:
1. Files have start and stop overhead, regardless of
whether they are memory mapped or external devices.

2. Not all platforms have enough extra memory to
load in the largest of text files.

3. Reduce the number of branches / jumps.
All processors love sequential data processing
more than branching to different locations.
See also: Loop Unrolling, Pipelining, Instruction
Cache

4. Beware of task / program swapping.
Swapping your task with another by the operating system
may eat up any time you save by optimizing your program.

5. *** Is this optimization necessary? ***
Don't waste your efforts on unwarranted optimization.
Missing events / interrupts because of slow code is
more important than being a little slow to process
data.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.comeaucomputing.com/learn/faq/
Other sites:
http://www.josuttis.com -- C++ STL Library book
http://www.sgi.com/tech/stl -- Standard Template Library
 
B

Ben Pfaff

First of all there is a big difference in performance between using
character by character reading via fgets() and block reading via
fread(). Obviously use the later if possible (you can't use it on
stdin unless you are using POSIX and isatty() returns 0).

You'll have to explain your rationale here. I'm not aware of
anything in the standard that says fread() may not be used on an
interactive device.

[...]
So this leads us to the inevitable conclusion that you should read
moderately small sized blocks (something that fits in your L1 cache,
but large enough to marginalize the overhead for calling fread() --
something like a 1K to 4K block is appropriate for just about any
existing system.) Then just do a frequency count of each character in
the block from there.

This is the "sensible" implementation I was referring to.
Hint: You'll *demolish* the performance of what CBFalconer posted if
you do this.

For what it's worth I hadn't looked at CBFalconer's
implementation when I posted, so that wasn't what I was referring
to.
 
D

Dave Vandervies

Chris Hatton wrote:

The ultimate efficiency is to read the entire text file
into memory then process the data, place results in
a buffer, then output the buffer.

The objectives are:
1. Keep the I/O continuous.
Don't break it up.

2. Process data in memory.
In most circumstances, processing data in memory
is faster than processing directly from I/O.

3. Use processor registers.
Design your code so the compiler can maximize the
use of the processor.

4. Use other processors.
If your platform has a DMA device that can place
data from a text file into memory without burdening
the main processor, use it.

All of these will be done to the degree that they're at all useful
by any self-respecting implementation without the programmer having
to do anything about it. (#4 should be done by the OS, #3 by any
self-respecting optimizing compiler, #1 by the stdio implementation,
and #2 comes for free given that you're pulling data out of the stdio
buffer most of the time.)

3. Reduce the number of branches / jumps.
All processors love sequential data processing
more than branching to different locations.
See also: Loop Unrolling, Pipelining, Instruction
Cache

See also: Branch prediction.
A good optimizing compiler targeting hardware that supports it should
give you a "predict taken" at the branch back up to the top of the loop
and a "predict not taken" at the branch to the fill-buffer code, which
in the common case gives you the same benefits as nice sequential code
without actually increasing the code size.



Leave the micro-optimization to the compiler - it's a lot better at it
than you are. (If it's not, it's probably time to start looking for a
better compiler.)


dave
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,159
Messages
2,570,879
Members
47,417
Latest member
DarrenGaun

Latest Threads

Top