C++ programming challenge

Keith H Duggar · Jun 14, 2009

Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-freque...

The JT solution has a bug causing it to count only 70294 of the
287924224 total characters in the file. The following statement

r=fread(buffer,BUFSIZE,1,fp))

should be

r=fread(buffer,1,BUFSIZE,fp))

otherwise r is always 1 (except for the last read) causing the
inner for loop to examine only the 1st character of each block.
That is why the JT solution is so much faster than the others.
It is much slower when this error is corrected.

Here is actual counts (rather than %) for the bugged and fixed
JT code (where * are totals):

bugged fixed
* 70294 * 287924224
a 3834 a 15704064
b 644 b 2637824
c 2330 c 9543680
d 1838 d 7528448
e 6452 e 26427392
f 1418 f 5808128
g 1050 g 4300800
h 2118 h 8675328
i 4332 i 17743872
j 56 j 229376
k 354 k 1449984
l 1882 l 7708672
m 1312 m 5373952
n 3804 n 15581184
o 5198 o 21291008
p 1552 p 6356992
q 70 q 286720
r 4358 r 17850368
s 3360 s 13762560
t 4888 t 20021248
u 1648 u 6750208
v 654 v 2678784
w 830 w 3399680
x 112 x 458752
y 1292 y 5292032
z 22 z 90112

Finally the format string in

printf("%c %2.3f%%\n", i, (float)(100*counts) / (float)total );

is bugged. It should be

printf("%c %7.3f%%\n", i, (float)(100*counts) / (float)total );

ie %7.3f not %2.3f since floating point field width is the total
width not the width of the portion left of the decimal point. So
the JT timing results should be removed until fixed; and outputs
should be compared against a known correct output showing giving
the actual counts rather than normalized percentages.

KHD

Ioannis Vranos · Jun 15, 2009

Correction 2:

ios_base::sync_with_stdio(false);

was moved towards the beginning of main(), because the C++ standard mentions:

"Effects: If any input or output operation has occurred using the standard streams prior to the call, the
effect is implementation-defined".

#include <valarray>
#include <fstream>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#include <cctype>
#include <ctime>

int main(int argc, char **argv)
{
using namespace std;

// We disable synchronisation with stdio, to speed up C++ standard I/O.
ios_base::sync_with_stdio(false);

cout<< fixed;

// If argc!= 2, then either the number of arguments is not correct, or the platform does not
// support arguments.
if(argc!= 2)
{
cerr<< "\nUsage: "<< argv[0]<< " fileNameToRead\n\n";

return EXIT_FAILURE;
}

// The C++ basic character set is using the value range [0, 127].
// If we used vector<unsigned long>, it would not have any run-time difference in any modern compiler.
valarray<unsigned long> characterFrequencies(128);

// The array where the read characters will be stored.
char buffer[4* BUFSIZ]= {};

string characters= "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

clock_t time1, time2;

// We start timing.
time1= clock();

// We open the file
ifstream inputFile(argv[argc -1]);

// An error happened
if(not inputFile)
{
cerr<< "\nCould not open file for reading, exiting...\n\n";

return EXIT_FAILURE;
}

do
{
inputFile.read(buffer, sizeof(buffer));

for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer ];

}while(inputFile);

// Since rule 1 is: "Your program should be case insensitive when it counts letters",
// we add the results of lowercase characters and their equivallent uppercase letters together.
cout<< "\n\n\nThe letter frequencies are:\n";

unsigned long totalcharacterFrequencies= 0LU;

for(string::size_type i= 0; i< characters.size(); ++i)
totalcharacterFrequencies+= characterFrequencies[ characters ]+ characterFrequencies[
tolower(characters) ];

for(string::size_type i= 0; i< characters.size(); ++i)
{
double percentage= static_cast<double>(characterFrequencies[ characters ]+ characterFrequencies[
tolower(characters) ])/ totalcharacterFrequencies* 100;

cout<< characters<< ": "<< percentage<< "%\n";
}

// We "stop" timing.
time2= clock();

// We convert the timing to seconds.
double totalTimeInSeconds= static_cast<double>(time2- time1)/ CLOCKS_PER_SEC;

cout<<"\n\nThe whole process took "<< totalTimeInSeconds<< " seconds.\n";

cout<<"\n\nHave a nice day!\n";
}

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

James Kanze · Jun 15, 2009

I think you are relaxing the rules. I know Qt programming, and
most (if not all) Linux distributions provide the Qt
libraries.

May I use Qt?

If so... I have a class which manages memory mapping, portably
to Windows and Unix/Linux. Can I use that? (I rather suspect
that a mmap'ed solution would be faster than any of the
alternatives, at least on Linux.)

More generally, what are the rules? In particular:

-- What assumptions can I make concerning the input encoding?
Single byte? Derived from ASCII? None of the proposed
solutions I've seen handle case folding correctly for UTF-8.

-- What assumptions can I make with regards to portability?
(What about a system which doesn't support files, for
example?)

-- What's the largest file size the program must handle? (The
test data is only 1.1GB, which will fit into system cache on
some systems.) If the entire file fits into memory, things
can be done a lot faster.

-- What's the real criteria for "fastest"? If it's only to
read one input file (the one given), then obviously,
pre-calculating the results is the fastest solution;
presumably, that would be considered cheating. But things
like buffer size and the effects of parallelization will
depend on the system; for large enough files, the optimal
solution obviously involves forking/spawning, with separate
processes handling each block, but where the cutoffs will
occur will depend largely on the configuration of the
system: how much real memory is available, etc. (One of the
systems I have available has 32 cores and 48GB main memory.
The trade offs for a 500 GB file on that system are
obviously different than those for a 1 GB file on a twin
core with 2GB main memory.)

James Kanze · Jun 15, 2009

(e-mail address removed)>, (e-mail address removed)
says...

Yet another bit of code:

#include <iostream>
#include <fstream>
#include <vector>
#include <numeric>
#include <iomanip>

class counter {
std::vector<double> counts;
long total;
public:
static const int num = 'z'-'a'+1;

counter() : counts(num), total(0) {}

counter operator+(char c) {
char ch = tolower(c);
if (isalpha(ch)) {
++total;
++counts[ch-'a'];
}
return *this;
}

friend std:stream &operator<<(std:stream &os, counter const &c)
{
for (int i=0; i<num; ++i)
if (isalpha('a'+i))
// The single most complex part of the program is getting the
// output formatted as your web site shows it!
os << char('a'+ i) << " "
<< std::fixed
<< std::setw(6)
<< std::setprecision(3)
<< c.counts/c.total*100.0 << "%\n";
return os;
}
};

int main(int argc, char **argv) {
std::ifstream input(argv[1]);

Click to expand...

counter &counts = std::accumulate(
std::istream_iterator<char>(input),
std::istream_iterator<char>(),
counter());

Click to expand...

std::cout << counts;
return 0;
}

Click to expand...

I suppose it's open to argument whether I might be abusing
std::accumulate and/or operator overloading. I (obviously)
don't really think so, but I'll admit that especially in the
latter case there's room for question.

Click to expand...

I think it's a good example of the sort of thing accumulate
should be usable for. Note, however, that accumulate will
normally copy your counter twice for each element, which isn't
going to help performance much. In the past, I've used a
somewhat hacky solution to avoid this:

class Counter
{
// as above, with in addition...
class Dummy {} ;
class AddOp
{
Dummy operator+( Counter& dest, char ch ) const
{
dest += ch ;
return Dummy() ;
}
} ;
Counter& operator=( Dummy const& )
{
return *this ;
}
} ;

Then invoke accumulate with Counter::AddOp() as the fourth
argument. It's horrible, but when with my SHA-1 accumulator, it
made more than an order of magnitude of difference in the
performance---and my SHA-1 accumulator was probably a lot
cheaper to copy than your Counter class.

Also, I'd avoid floating point for the accumulation---long long
seems preferable (but since we have no idea what the largest
file size is that needs to be handled, who knows).

I don't have the right sort of machine handy to test it with,
but I believe this should work with something like EBCDIC.
Making it work correctly with something like almost any
Unicode encoding would take substantial modifications.

Click to expand...

It will work for EBCDIC, but it won't work for ISO 8859-1 (which
is probably the most widespread single byte encoding).

Which may or may not be a problem. We still don't know what the
program we're supposed to write is supposed to do.

James Kanze · Jun 15, 2009

[ ... ]

counter operator+(char c) {

Click to expand...

Click to expand...

Oops, that should be:

Click to expand...

counter &operator+(char c) {

Click to expand...

Given the degree to which it's I/O bound, it probably doesn't
make any real difference, but at least in theory this should
improve speed a bit.

Click to expand...

I'm not sure that this will work; at any rate, it's even worse
than my hack, since you've given the + operator very unexpected
semantics. And to get any real benefits, you'd also have to
special case self assignment in your class; the problem is that
your class is expensive to copy, and std::accumulate is designed
to copy---twice for each element, logically.

James Kanze · Jun 15, 2009

[ ... ]
Perhaps a slightly larger change would be worthwhile:

Click to expand...

int main(int argc, char **argv) {
static char buffer[20*1024*1024];

Click to expand...

Why static?

std::ifstream input(argv[1]);
input.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
input.sync_with_stdio(false);

Click to expand...

std::cout << std::accumulate(
std::istream_iterator<char>(input),
std::istream_iterator<char>(),
counter());
return 0;
}

Click to expand...

It's hard to guess exactly how much difference those changes
will make, but they might be significant.

Click to expand...

The real problem is that they might be very significant on your
system, and actually reduce performance on his, or vice versa.
The real problem is that the tunings which will most affect the
performance will depend on the configuration: the system, but
also the amount of real memory available, etc. So writing code
on our machines which will be timed on his is really stabbing in
the dark. (That's also a bit of what's behind my "why static".
I know why, of course---some machines won't support that much
memory on the stack. But mine will, and so will the machine
he's testing it on. And of course, some machines won't accept a
static variable of that size, either.)

Of course, it's all a bit pointless -- if you're really
processing large files often enough to care about speed, some
non-portable code will usually make a much larger difference
than anything like this.

Click to expand...

The point is that any I/O tuning you do is really not portable.
You might be able to improve performance by changing the size of
the buffer, but the optimal size will depend on the
configuration where the program runs. And you might loose.

James Kanze · Jun 15, 2009

Even programs which allow one to customize the mode via
pre-defined macros?

That leads to an interesting question. Can I do something like:

#if defined(_POSIX_VERSION) || defined(__linux__)
// mmap based implementation
#else
// iostream based implementation
#endif
?

James Kanze · Jun 15, 2009

* Ioannis Vranos:

In principle you may be right, because binary mode won't
necessarily work on a system where a file is not simply a
sequence of bytes.

But in practice, for systems like Windows and *nix, binary
mode just works, and for systems like Windows it's required to
avoid time-consuming conversions of e.g. newlines on some
systems.

If you're restricting yourself to systems like Windows and
*nix... I've a class MemoryMappedData with implementations for
both Windows and *nix, and I'm willing to bet that using it,
instead of iostream or stdio.h, could make a significant
difference.

But since the original poster hasn't deigned to specify exactly
what he meant by portability, all we can do is guess whether
such code is acceptable or not.

Since the programs are not counting lines binary mode has no
other impact.

There's likely to be a difference in the frequency of '\r'

.
(Counting lines is one thing where it won't make a difference,
at least under Windows. There will still be the same number of
'\n'.)

James Kanze · Jun 15, 2009

I will keep the binary mode programs in the list since it
interesting to see if it makes any speed difference compared
to the non-binary mode programs.

You're testing on Linux. Under Linux, binary mode and text mode
are identical; the flag is ignored. Under Windows, it could
very well make a difference in performance. And under systems
other than Unix and Windows, it could mean that you can't open a
text file.

James Kanze · Jun 15, 2009

* Ioannis Vranos:

Text files are more portable than binaries.

That's debatable. Files are portable when their format is
strictly specified, text or binary has very little to do with
it. (And of course, in C++, files created in binary mode often
use a text format.) In many senses, text files are less
portable, because considerations like encoding enter into
consideration.

However, text files are not magically portable.

The two main issues affecting text file portability are
character encoding and end-of-line convention.

And just about every other formatting consideration.

I'm sorry, that's incorrect: first, binary mode is never
implemented as text mode, so that part is simply wrong,

He's probably referring to Unix, where files created in binary
mode and in text mode are identical. The fact is that an
implementation under Unix will normally ignore the mode flag.

and although reading trash (garbage data) is possible on some
systems, those archaic systems are not very common, to put it
mildly, so the "either/or" is simply wrong.

First, of course, the systems aren't "archaïc", since they're
still being sold today. But the issue is more complex. If you
read and write a file under a given system, what you get when
changing modes is implementation defined. I could easily create
an input file for the current program which would give radically
different results depending on the mode used to read it under
Windows. (Just put a 0x1A near the beginning.) But that
doesn't mean that what you're reading is "trash"; on all the
systems I know, text files (and binary files) obey specific
rules; if you know the rules, you know what to expect when you
read using a different mode.

[...]

There are systems where opening a text file as binary may not
work. In the programs submitted the use of binary mode has
been intentional and not a mistake.

The question is more: is a program which opens the file in
binary mode conform to the specifications. And we can't really
give an answer, since the specifications are too vague.

Alf P. Steinbach · Jun 15, 2009

* James Kanze:

If you're restricting yourself to systems like Windows and
*nix... I've a class MemoryMappedData with implementations for
both Windows and *nix, and I'm willing to bet that using it,
instead of iostream or stdio.h, could make a significant
difference.

But since the original poster hasn't deigned to specify exactly
what he meant by portability, all we can do is guess whether
such code is acceptable or not.

There's likely to be a difference in the frequency of '\r' .
(Counting lines is one thing where it won't make a difference,
at least under Windows. There will still be the same number of
'\n'.)

Counting lines is one instance where it makes a difference, because for old Mac
files you have to count the '\r' chars, and so excepting conditional compilation
that involves an extra choice, hence extra processing.

Cheers & hth.,

- Alf

Alf P. Steinbach · Jun 15, 2009

* James Kanze:

That's debatable.

Nope.

And I note that you've even said so yourself.

Files are portable when their format is
strictly specified, text or binary has very little to do with
it. (And of course, in C++, files created in binary mode often
use a text format.) In many senses, text files are less
portable, because considerations like encoding enter into
consideration.

Encoding is an issue, as I noted below, but how the data is encoded as bytes is
more of an issue with a binary file, as you well know.

And just about every other formatting consideration.

I'm sorry, that's an invalid argument. It amounts to saying that it's possible
to define a text file format that's difficult to parse. That applies equally to
any representation whatsoever, and hence doesn't favor any representation.

However, in the other direction, choosing a format that can be easily and
portably parsed is easy with text files, and practically impossible with binary
files.

He's probably referring to Unix, where files created in binary
mode and in text mode are identical. The fact is that an
implementation under Unix will normally ignore the mode flag.

Yes, but even in *nix binary mode is not implemented as text mode (in *nix that
distinction is invalid concerning the behavior, and is just plain wrong at the
abstract C++ level).

First, of course, the systems aren't "archaïc", since they're
still being sold today.

Oh yes they are.

But the issue is more complex. If you
read and write a file under a given system, what you get when
changing modes is implementation defined. I could easily create
an input file for the current program which would give radically
different results depending on the mode used to read it under
Windows. (Just put a 0x1A near the beginning.)

Such a file would, in Windows, qualify as a (partially) binary file. Anyways, it
would not be a pure text file since it would contain data that according to one
old convention were not meant to be parsed as part of the primary text stream.
Another way to embed data (textual or not) not part of the primary text stream
is to use additional NTFS streams in the file, or on a Mac, a resource fork, and
since this data is part of the "file" there is a problem wrt. the spec... ;-)

But that
doesn't mean that what you're reading is "trash"; on all the
systems I know, text files (and binary files) obey specific
rules; if you know the rules, you know what to expect when you
read using a different mode.

[...]

There are systems where opening a text file as binary may not
work. In the programs submitted the use of binary mode has
been intentional and not a mistake.

Click to expand...

The question is more: is a program which opens the file in
binary mode conform to the specifications. And we can't really
give an answer, since the specifications are too vague.

It seems you've overlooked the website's specification of ASCII data.

That makes it pretty clearcut regarding the data, since counting lines is not
part of the problem (no impact on portability for byte-sequence-file systems).

It does not however solve the pedantic problem of what constitutes a "file".
Regarding that, however, one may note that none of the submissions have a
problem with that. It's simply assumed that the ordinary C and C++ notion of a
file as a sequence of bytes applies, and possibilities such as fixed size record
files, ^Z-terminated text, additional streams, or resource forks, are ignored.

Cheers & hth.,

- Alf

Jonathan Lee · Jun 15, 2009

For what it's worth I also coded up a version

that read longs instead of chars, and
accumulated 16 bits at a time in the "count"
array (fixing the calculation at the end).

Since the strategy has changed for the testing,
here's the code I previously referred to. Runs
faster than any of the others I've tried (on
my computer, of course).

Uses some ugly casting which I'm sure some will
disagree with. Also, it assumes longs are no
more than 64 bits. And it assumes chars are 8
bits. Whatever. It's more to demonstrate the
idea than anything else.

// -----------------------------------------------

#include <iostream>
#include <fstream>
using namespace std;

#define BUFFSIZE 4096

char const* digit = "0123456789";
char const* alpha = "abcdefghijklmnopqrstuvwxyz";
char const* ALPHA = "ABCDEFGHIJKLMNOPQRSTUVWZYZ";

int main(int argc, char* argv[]) {
if( argc <= 1 )
return (cout << "Please provide a file name" << endl, 1);

char buff[BUFFSIZE];
long count[65536] = {0};
long total_alpha[26] = {0};
long count2[256] = {0};

CStopWatch s;
s.startTimer();

ifstream f(argv[1]);

if (f.good()) {
do {
f.read(buff, BUFFSIZE);
streamsize bytesread = f.gcount();
streamsize i = 0;
for (; i < bytesread; i += sizeof(unsigned
long)) {
unsigned long c = *(reinterpret_cast<unsigned long*>(buff + i));
unsigned long d = c >> 32;
++count[c & 0xFFFF];
++count[d & 0xFFFF];
c >>= 16;
d >>= 16;
++count[c & 0xFFFF];
++count[d & 0xFFFF];
}
for (; i < bytesread; ++i)
++count[buff];
} while (!f.eof());

for (int i = 0; i < 65536; ++i) {
count2[i & 0xFF] += count;
count2[(i >> 8) & 0xFF] += count;
}
unsigned long total = 0;
for (int i = 0; i < 26; ++i) {
unsigned long x = count2[alpha] + count2
[ALPHA];
total_alpha = x;
total += x;
}
float p2 = 100.0f / total;
for (int i = 0; i < 26; ++i) {
float p = p2 * total_alpha;
cout << alpha << ' ' << p << '%' << '\n';
}
cout << endl;
}
s.stopTimer();
cout << "Time elapsed: " << s.getElapsedTime() << endl <<
endl;

return 0;
}

Chris M. Thomasson · Jun 15, 2009

Jonathan Lee said:
For what it's worth I also coded up a version
that read longs instead of chars, and
accumulated 16 bits at a time in the "count"
array (fixing the calculation at the end).

Click to expand...

Since the strategy has changed for the testing,
here's the code I previously referred to. Runs
faster than any of the others I've tried (on
my computer, of course).

Uses some ugly casting which I'm sure some will
disagree with. Also, it assumes longs are no
more than 64 bits. And it assumes chars are 8
bits. Whatever. It's more to demonstrate the
idea than anything else.
[...]

For some reason I cannot get this to work on my system. I placed the
following line:

cout << "read: " << bytesread << endl;

right after the call to:

streamsize bytesread = f.gcount();

And I get never ending output of:

read: 0
read: 0
read: 0
read: 0
read: 0
read: 0
read: 0
read: 0
read: 0

Not sure whats going on.

Jonathan Lee · Jun 15, 2009

For some reason I cannot get this to work on my system.

...
Not sure whats going on.

Nor I. It works on two computers I've tried. A Intel Core 2 Duo, 64-
bit. And a Xeon Quad, 32-bit. The latter was tested from copying and
pasting the above (i.e., the source, as posted, works at least on one
computer :/ )

The output you described would mean that the file hasn't reached EOF
and yet it is reading 0 bytes... Erm... don't know why that would be
right now.. ??

Maybe I've got a buffer overrun and the ifstream is getting corrupted.
I'll take a closer look later.

Nils · Jun 15, 2009

Hi,

here's my code. StopWatch source needs to be added on top of the file. I
cannot post it as comment on the page because it says my "html tag"
<fstream> is not accepted... Oo...

I would prefer to start the timing *after* the file has been read,
because this way it's too I/O limited.

Regards,

Nils

//////////

#include <fstream>
#include <iostream>

#include <time.h>

using namespace std;

static const size_t g_blockSize = 1024 * 1024; // 1MB

static char g_buf[g_blockSize];

int main(int argc, char* argv[])
{
if( argc < 2 )
{
std::cout << "usage: <app> <text file>" << std::endl;
return -1;
}

unsigned int letterFreq[256];
memset( letterFreq, 0, sizeof(letterFreq) );

// start counting time
CStopWatch s;
s.startTimer();

FILE* hFile = fopen( argv[1], "rb" ); // binary mode please

if( 0 == hFile )
{
std::cout << "file << " << argv[1] << "couldn't be opened." << std::endl;
return -1;
}

// count chars
size_t totalReadAmount = 0;

size_t countRead = fread( g_buf, 1, g_blockSize, hFile );

while( countRead > 0 )
{
totalReadAmount += countRead;

// count chars
int remaining = (int)countRead;

{
char* ptr = g_buf;

size_t n = int(countRead+7)/8;
switch(countRead&7)
{
case 0: do{ letterFreq[ *ptr++ ]++;
case 7: letterFreq[ *ptr++ ]++;
case 6: letterFreq[ *ptr++ ]++;
case 5: letterFreq[ *ptr++ ]++;
case 4: letterFreq[ *ptr++ ]++;
case 3: letterFreq[ *ptr++ ]++;
case 2: letterFreq[ *ptr++ ]++;
case 1: letterFreq[ *ptr++ ]++;
}while( --n > 0 );
}
}

countRead = fread( g_buf, 1, g_blockSize, hFile );
}

// compute result in percent
const double hunderedDivFileSize = 100.0 / double(totalReadAmount);

for( char i='a'; i<='z'; i++ )
{
unsigned int totalLetterFreq = letterFreq + letterFreq[ i + 'A' -
'a' ];

const double freq = hunderedDivFileSize * double(totalLetterFreq);

cout << i << ' ' << freq << "%" << std::endl;
}

s.stopTimer();

cout << "Total Time: " << s.getElapsedTime() << std::endl;

fclose( hFile );

return 0;
}

Chris M. Thomasson · Jun 15, 2009

Peter Jansson said:
Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

Here is one more submission:
________________________________________________________________________
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <limits.h>
#include <ctype.h>

#define CONCATX(mp_token1, mp_token2)mp_token1 ## mp_token2
#define CONCAT(mp_token1, mp_token2)CONCATX(mp_token1, mp_token2)

#if ! defined (FILE_MODE)
# define FILE_MODE "rb"
#endif

#define BUFFER 65536U

#define ABC_LOWER ("abcdefghijklmnopqrstuvwxyz")
#define ABC_UPPER ("ABCDEFGHIJKLMNOPQRSTUVWXYZ")

#define ABC_COUNT(mp_index, mp_case) ( \
counts[(unsigned char)(CONCAT(ABC_, mp_case)[(mp_index)])] \
)

#define CALC_FREQ(mp_index, mp_scale) ( \
((double)ABC_COUNT(mp_index, LOWER) * (double)(mp_scale)) \
)

int main(int argc, char** argv) {
if (argc > 1) {
FILE* file = fopen(argv[1], FILE_MODE);

if (file) {
unsigned char buf[BUFFER];
unsigned long counts[UCHAR_MAX + 1] = { 0 };
size_t size;
int status = EXIT_SUCCESS;
unsigned long total = 0;

setvbuf(file, 0, _IONBF, 0);

while ((size = fread(buf, 1, BUFFER, file))) {
size_t tmpsize;

for (tmpsize = 0; tmpsize < size; ++tmpsize) {
++counts[buf[tmpsize]];
}

if (size < BUFFER) break;
}

if (ferror(file)) status = EXIT_FAILURE;
if (fclose(file)) status = EXIT_FAILURE;

if (status == EXIT_SUCCESS) {
size_t i;
double scale;

for (i = 0; i < 26; ++i) {
ABC_COUNT(i, LOWER) += ABC_COUNT(i, UPPER);
total += ABC_COUNT(i, LOWER);
}

scale = 100.0 / (double)total;

for (i = 0; i < 26; ++i) {
printf("%c %%%.3f\n", ABC_LOWER, CALC_FREQ(i, scale));
}

return EXIT_SUCCESS;
}
}
}

fprintf(stderr, "file error...");

assert(0);

return EXIT_FAILURE;
}

________________________________________________________________________

It should improve the performance time of my previous submissions by a
little bit. It also should work with a variety of character sets.

Jerry Coffin · Jun 15, 2009

[ ... ]

I think it's a good example of the sort of thing accumulate
should be usable for. Note, however, that accumulate will
normally copy your counter twice for each element, which isn't
going to help performance much. In the past, I've used a
somewhat hacky solution to avoid this:

Right -- as you've seen, I used a truly crufty hack to avoid it, which
worked in this case (at least with the compiler I tested on -- I'd have
to do a bit to work to be sure whether it's really guaranteed to work).

My comment about this being an abuse of accumulate wasn't about the
basic idea of using accumulate to um...accumulate the statistics -- to
me, that seemed perfectly straightforward. It was about the contortion
of having operator+ return a reference to its input to reduce copying.

[ ... ]

It will work for EBCDIC, but it won't work for ISO 8859-1 (which
is probably the most widespread single byte encoding).

Right -- as it stands, it assumes that all lower case letters fall in
the range 'a'..'z', which is true for ASCII and EBCDIC, but not much of
anything else. Of course, if you wanted to make it work for ISO 8859,
you'd just use UCHAR_MAX instead, and use (or cast to) unsigned char
instead of char in a couple of places. You'd only be using a fairly
small fraction of the allocated memory, but for typical values of
UCHAR_MAX (e.g., 255) it's not a big deal, at least a long as you avoid
copying much.

After more thought, however, I suspect this might work better for
Unicode/ISO 10646 than I'd previously assumed, at least on most typical
systems. The problem is that the array usage is quite sparse. As long as
we're careful, however, we can get the virtual memory manager to handle
the sparseness for us.

As long as we never "touch" -- not even intialize -- the parts of the
array that don't represent alphabetic characters, the virtual memory
manager won't (at least in a typical case) allocate any physical memory
for that address. Even though the alphabetic characters aren't all
arranged in one place, I believe they are arranged into clusters --
which should work pretty well for a virtual memory manager.

I think I might have to write a bit of test code, to figure out how much
memory this really uses with something like UCS-4 (which I think would
be about the worst case).

Thomas J. Gritzan · Jun 15, 2009

Jonathan said:
Nor I. It works on two computers I've tried. A Intel Core 2 Duo, 64-
bit. And a Xeon Quad, 32-bit. The latter was tested from copying and
pasting the above (i.e., the source, as posted, works at least on one
computer :/ )

The output you described would mean that the file hasn't reached EOF
and yet it is reading 0 bytes... Erm... don't know why that would be
right now.. ??

Maybe I've got a buffer overrun and the ifstream is getting corrupted.
I'll take a closer look later.

Because you used fstream::eof in a wrong way, as we concluded in the
thread "Correction of FAQ - [15.5]".

The loop should be:

do {
// ...
} while (f);

instead of
/*...*/ while (!f.eof());

I wonder where this wrong usage of eof comes from. Is it some bad book
or something?

Jonathan Lee · Jun 15, 2009

Because you used fstream::eof in a wrong way, as we concluded in the

thread "Correction of FAQ - [15.5]".

Right you are. Unfortunately, while this explains the loop of 0s it
doesn't explain why the read failed.

I wonder where this wrong usage of eof comes from. Is it some bad book
or something?

No, not a bad book. It's because we (I) want to "do something while we
haven't reached the end of the file". The code reflects that. As I see
it, the problem actually is with assuming the read occurred without
error. Of course, your suggestion admirably handles the error case and
the EOF at the same time. But I'm sure you can see why it was coded
that way.

Thanks for the tip

--Jonathan

C programming in 2011	148	May 26, 2011
Ann: Registration for the Python Game Programming Challenge is nowopen!	0	Jul 29, 2005
Requirement @ Appulse	0	Jul 24, 2008
Some errors in MIT's intro C++ course	109	Sep 8, 2010
Looking for C++ Professionals (Bangalore)	4	Aug 17, 2007
C language now truly universal	0	Jan 1, 2011
Seek Contract Programming Work - 17 Years Experience	0	Feb 22, 2005
Fundamentals of Financial Management Concise 7e Brigham Houston	0	May 1, 2011

C++ programming challenge

Keith H Duggar

Ioannis Vranos

James Kanze

James Kanze

James Kanze

James Kanze

James Kanze

James Kanze

James Kanze

James Kanze

Alf P. Steinbach

Alf P. Steinbach

Jonathan Lee

Chris M. Thomasson

Jonathan Lee

Nils

Chris M. Thomasson

Jerry Coffin

Thomas J. Gritzan

Jonathan Lee

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads