Efficency

Chris Hatton · Feb 11, 2005

I realize this is not a newsgroup for the total novice or homework
questions so this will be the last post until I become more familiar
with the language.

I'm looking to optimize a piece of code that reads characters from a
text file and counts the frequency of them. If someone could point me in
the right direction that would be great. I'm not looking for any code
just a couple of links to good websites as my time is very limited at
the moment.

thanks, and apologies if this post is not appropriate.

Chris

Ben Pfaff · Feb 11, 2005

Chris Hatton said:
I'm looking to optimize a piece of code that reads characters from a
text file and counts the frequency of them. If someone could point me
in the right direction that would be great. I'm not looking for any
code just a couple of links to good websites as my time is very
limited at the moment.

Such code, if written in any kind of sensible way, will probably
be limited by the speed that you can read data from the input
file. I doubt there's much potential for optimization.

Dave Vandervies · Feb 11, 2005

I realize this is not a newsgroup for the total novice or homework
questions so this will be the last post until I become more familiar
with the language.

Well, it's for discussion about C. We're not here to teach you the
language or do your homework, but asking questions (even simple ones,
if they're not the ones that are so simple we've seen them INT_MAX times
already) is usually a good way to get an interesting discussion started.

I'm looking to optimize a piece of code that reads characters from a
text file and counts the frequency of them. If someone could point me in
the right direction that would be great. I'm not looking for any code
just a couple of links to good websites as my time is very limited at
the moment.

As noted already, if your code is written sensibly you shouldn't have
much optimizing left to do.

The standard library's I/O system will most likely already be reading
from the file as efficiently as possibly and buffering it for your
character-at-a-time read code, so the main loop will be doing something
like this:
--------
get character from buffer
check whether buffer needs to be refilled (it usually won't)
do stuff with character
--------
....and the "do stuff with character" part would have to be pretty
inefficient for the time taken to do this to be large enough to be
noticeable compared to the time spent refilling the buffer when that's
necessary.

The general rule is to leave the micro-optimization to the compiler,
and leave any other optimization until after it's shown to be needed.

That said, if you really want to make sure that the parts that you
can optimize yourself are as efficient as possible, it's worth noting
that characters are just small integers and that array indices are
also integers. I'll let you figure out where to go from there - if you
haven't already seen something like what I'm thinking of, it'll probably
end up being more valuable as an exercise in finding new ways to solve a
problem (and, hopefully, give you another tool to use for later problems)
than as an optimization exercise. (Be sure to show us[1] your code once
you've finished so we can point out all the problems in it. If you're
just learning the language, there WILL be a bunch.)

dave

[1] Or a local guru, if you have one. Or, better yet, both.

CBFalconer · Feb 11, 2005

Chris said:
I realize this is not a newsgroup for the total novice or homework
questions so this will be the last post until I become more familiar
with the language.

I'm looking to optimize a piece of code that reads characters from a
text file and counts the frequency of them. If someone could point
me in the right direction that would be great. I'm not looking for
any code just a couple of links to good websites as my time is very
limited at the moment.

Without any suitable includes etc. Needs limits and stdio at
least.

/* automatically zeroed */
static unsigned long counts[UCHAR_MAX+1];
....
int ch;
FILE *f;
....
/* open files etc */
....
while (EOF != (ch = getc(f))) counts[ch]++;
....
/* display the content of counts */

Stan Milam · Feb 12, 2005

CBFalconer said:
Chris said:

I realize this is not a newsgroup for the total novice or homework
questions so this will be the last post until I become more familiar
with the language.

I'm looking to optimize a piece of code that reads characters from a
text file and counts the frequency of them. If someone could point
me in the right direction that would be great. I'm not looking for
any code just a couple of links to good websites as my time is very
limited at the moment.

Click to expand...

Without any suitable includes etc. Needs limits and stdio at
least.

/* automatically zeroed */
static unsigned long counts[UCHAR_MAX+1];
....
int ch;
FILE *f;
....
/* open files etc */
....
while (EOF != (ch = getc(f))) counts[ch]++;
....
/* display the content of counts */

Damn! I knew someone would beat me to it. A question I could actually
answer, too.

--
Regards,
Stan Milam
=============================================================
Charter Member of The Society for Mediocre Guitar Playing on
Expensive Instruments, Ltd.
=============================================================

infobahn · Feb 12, 2005

Stan said:
Chris said:

I'm looking to optimize a piece of code that reads characters from a
text file and counts the frequency of them. [...]

Click to expand...

Without any suitable includes etc. Needs limits and stdio at
least.

/* automatically zeroed */
static unsigned long counts[UCHAR_MAX+1];
....
int ch;
FILE *f;
....
/* open files etc */
....
while (EOF != (ch = getc(f))) counts[ch]++;
....
/* display the content of counts */

Click to expand...

Damn! I knew someone would beat me to it. A question I could actually
answer, too.

You missed the chance to crit him, too. The static array makes the
program unsuitable for functionalising; he could easily have got
the desired zeroing behaviour using = {0}. He also forgot to point
out a few rules of optimisation, such as "don't do it", for instance.

=============================================================
Charter Member of The Society for Mediocre Guitar Playing on
Expensive Instruments, Ltd.
=============================================================

Where do I sign up?

websnarf · Feb 13, 2005

Ben said:
Such code, if written in any kind of sensible way, will probably
be limited by the speed that you can read data from the input
file.

This is correct.

[...] I doubt there's much potential for optimization.

But this is not.

First of all there is a big difference in performance between using
character by character reading via fgets() and block reading via
fread(). Obviously use the later if possible (you can't use it on
stdin unless you are using POSIX and isatty() returns 0). Ok, second
of all, if the data itself is large enough, then having the whole thing
in memory will cause additional file activity because of memory
swapping. Even data going outside of your CPU's on-chip L1 cache will
slow things down a bit.

So this leads us to the inevitable conclusion that you should read
moderately small sized blocks (something that fits in your L1 cache,
but large enough to marginalize the overhead for calling fread() --
something like a 1K to 4K block is appropriate for just about any
existing system.) Then just do a frequency count of each character in
the block from there.

Hint: You'll *demolish* the performance of what CBFalconer posted if
you do this.

CBFalconer · Feb 13, 2005

.... snip ...

So this leads us to the inevitable conclusion that you should read
moderately small sized blocks (something that fits in your L1 cache,
but large enough to marginalize the overhead for calling fread() --
something like a 1K to 4K block is appropriate for just about any
existing system.) Then just do a frequency count of each character
in the block from there.

Hint: You'll *demolish* the performance of what CBFalconer posted if
you do this.

If you are referring to the reply where I said:

"while (EOF != (ch = getc(f))) counts[ch]++;"

you might be surprised. You will note that I used getc, which may
be a macro, and may operate directly on the system buffers. It
doesn't have to be a macro, that depends on many things including
QOI. If it is, it is likely to expand (inline) to something like:

if (!(f->ix)) _fillbuff(f);
if (f->eof) return EOF;
else return f->buff[ix--];

Thomas Matthews · Feb 13, 2005

Chris said:
I realize this is not a newsgroup for the total novice or homework
questions so this will be the last post until I become more familiar
with the language.

I'm looking to optimize a piece of code that reads characters from a
text file and counts the frequency of them. If someone could point me in
the right direction that would be great. I'm not looking for any code
just a couple of links to good websites as my time is very limited at
the moment.

thanks, and apologies if this post is not appropriate.

Chris

The ultimate efficiency is to read the entire text file
into memory then process the data, place results in
a buffer, then output the buffer.

The objectives are:
1. Keep the I/O continuous.
Don't break it up.

2. Process data in memory.
In most circumstances, processing data in memory
is faster than processing directly from I/O.

3. Use processor registers.
Design your code so the compiler can maximize the
use of the processor.

4. Use other processors.
If your platform has a DMA device that can place
data from a text file into memory without burdening
the main processor, use it.

Remember these points:
1. Files have start and stop overhead, regardless of
whether they are memory mapped or external devices.

2. Not all platforms have enough extra memory to
load in the largest of text files.

3. Reduce the number of branches / jumps.
All processors love sequential data processing
more than branching to different locations.
See also: Loop Unrolling, Pipelining, Instruction
Cache

4. Beware of task / program swapping.
Swapping your task with another by the operating system
may eat up any time you save by optimizing your program.

5. *** Is this optimization necessary? ***
Don't waste your efforts on unwarranted optimization.
Missing events / interrupts because of slow code is
more important than being a little slow to process
data.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.comeaucomputing.com/learn/faq/
Other sites:
http://www.josuttis.com -- C++ STL Library book
http://www.sgi.com/tech/stl -- Standard Template Library

Ben Pfaff · Feb 13, 2005

First of all there is a big difference in performance between using
character by character reading via fgets() and block reading via
fread(). Obviously use the later if possible (you can't use it on
stdin unless you are using POSIX and isatty() returns 0).

You'll have to explain your rationale here. I'm not aware of
anything in the standard that says fread() may not be used on an
interactive device.

[...]

So this leads us to the inevitable conclusion that you should read
moderately small sized blocks (something that fits in your L1 cache,
but large enough to marginalize the overhead for calling fread() --
something like a 1K to 4K block is appropriate for just about any
existing system.) Then just do a frequency count of each character in
the block from there.

This is the "sensible" implementation I was referring to.

Hint: You'll *demolish* the performance of what CBFalconer posted if
you do this.

For what it's worth I hadn't looked at CBFalconer's
implementation when I posted, so that wasn't what I was referring
to.

Dave Vandervies · Feb 14, 2005

Chris Hatton wrote:

The ultimate efficiency is to read the entire text file
into memory then process the data, place results in
a buffer, then output the buffer.

The objectives are:
1. Keep the I/O continuous.
Don't break it up.

2. Process data in memory.
In most circumstances, processing data in memory
is faster than processing directly from I/O.

3. Use processor registers.
Design your code so the compiler can maximize the
use of the processor.

4. Use other processors.
If your platform has a DMA device that can place
data from a text file into memory without burdening
the main processor, use it.

All of these will be done to the degree that they're at all useful
by any self-respecting implementation without the programmer having
to do anything about it. (#4 should be done by the OS, #3 by any
self-respecting optimizing compiler, #1 by the stdio implementation,
and #2 comes for free given that you're pulling data out of the stdio
buffer most of the time.)

3. Reduce the number of branches / jumps.
All processors love sequential data processing
more than branching to different locations.
See also: Loop Unrolling, Pipelining, Instruction
Cache

See also: Branch prediction.
A good optimizing compiler targeting hardware that supports it should
give you a "predict taken" at the branch back up to the top of the loop
and a "predict not taken" at the branch to the fill-buffer code, which
in the common case gives you the same benefits as nice sequential code
without actually increasing the code size.

Leave the micro-optimization to the compiler - it's a lot better at it
than you are. (If it's not, it's probably time to start looking for a
better compiler.)

dave

Seeking co-founders for my company.	3	Sep 8, 2024
Bash scripts for web apps	1	Jan 16, 2023
Upgrading Company's Internal Record Keeping Systems	0	Sep 24, 2021
Intro and asking for direction.(newbie am I)	2	Jun 29, 2020
First time I looked at Python was(...)	32	Jun 9, 2014
To go with Go or C/C++?	147	May 1, 2013
Random walk across array	6	Jul 16, 2010
Odd Macro Problem	3	Jan 29, 2011

Efficency

Chris Hatton

Ben Pfaff

Dave Vandervies

CBFalconer

Stan Milam

infobahn

websnarf

CBFalconer

Thomas Matthews

Ben Pfaff

Dave Vandervies

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads