C++ programming challenge

Chris M. Thomasson · Jun 11, 2009

Ioannis Vranos said:
Chris said:

Ioannis Vranos said:

(e-mail address removed) wrote:

Actually, just changing the line :
ifstream inputFile(argv[argc -1]);
with
ifstream inputFile(argv[argc -1],std::ios::binary);

beats any other post (on my PC at least )

Here's the new code (just added the binary open mode)

Binary mode isn't the same as text mode (it will not work) in all
platforms.

Click to expand...

I only care about the platforms in which it does work. ;^)

Click to expand...

OK, I have no problem if you can't provide a better solution, after all it
is just a challenge. When I say it will not work, I mean you will get
inaccurate results, and there is a chance that the program already does
not work in your platform, that is there is a chance you are already
getting inacaccurate results.

I only care about the platforms in which it works.

Second, are you the same person with the poster I replied
([email protected])?

No. I post with email address of ([email protected]).

Chris M. Thomasson · Jun 11, 2009

Ioannis Vranos said:
Chris said:

Ioannis Vranos said:

(e-mail address removed) wrote:

Actually, just changing the line :
ifstream inputFile(argv[argc -1]);
with
ifstream inputFile(argv[argc -1],std::ios::binary);

beats any other post (on my PC at least )

Here's the new code (just added the binary open mode)

Binary mode isn't the same as text mode (it will not work) in all
platforms.

Click to expand...

I only care about the platforms in which it does work. ;^)

Click to expand...

OK, I have no problem if you can't provide a better solution, after all it
is just a challenge. When I say it will not work, I mean you will get
inaccurate results, and there is a chance that the program already does
not work in your platform, that is there is a chance you are already
getting inacaccurate results.

I only care about the platforms in which it works.

Second, are you the same person with the poster I replied
([email protected])?

No. I post with email address of ([email protected]).

What makes you think that? I post a lot over in `comp.programming.threads'.

Ioannis Vranos · Jun 11, 2009

Chris said:
Ioannis Vranos said:

Chris said:

(e-mail address removed) wrote:

Actually, just changing the line :
ifstream inputFile(argv[argc -1]);
with
ifstream inputFile(argv[argc -1],std::ios::binary);

beats any other post (on my PC at least )

Here's the new code (just added the binary open mode)

Binary mode isn't the same as text mode (it will not work) in all
platforms.

I only care about the platforms in which it does work. ;^)

Click to expand...

OK, I have no problem if you can't provide a better solution, after
all it is just a challenge. When I say it will not work, I mean you
will get inaccurate results, and there is a chance that the program
already does not work in your platform, that is there is a chance you
are already getting inacaccurate results.

Click to expand...

I only care about the platforms in which it works.

Second, are you the same person with the poster I replied
([email protected])?

Click to expand...

No. I post with email address of ([email protected]).

What makes you think that? I post a lot over in `comp.programming.threads'.

Because you replied to a message of mine addressed to (e-mail address removed), as you can also see from
the quotations.

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Alf P. Steinbach · Jun 11, 2009

* Peter Jansson:

My mistake, have fixed it.

I made 2 small changes to Ioannis' program.

It sort of sped up... ;-)

<code>
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <assert.h>
#include <limits.h>

#if ! defined (FILE_MODE)
# define FILE_MODE "rb"
#endif

typedef char sassert[
'Z' - 'A' == 25 &&
'z' - 'a' == 25 ? 1 : -1
];

//#define BUFFER 65536U
#define BUFFER 1025*1024U // !Alf

#define GET_COUNT(mp_self, mp_index) ( \
(mp_self)[(mp_index) + 'a'] + \
(mp_self)[(mp_index) + 'A'] \
)

int main(int argc, char** argv) {
if (argc > 1) {
size_t i;
unsigned long total = 0;
unsigned long counts[UCHAR_MAX + 1] = { 0 };
clock_t const start = clock();
FILE* file = fopen(argv[1], FILE_MODE);
setvbuf( file, 0, _IONBF, 0 ); // !Alf

if (file) {
size_t size;
int status = EXIT_SUCCESS;
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {
size_t tmpsize = size;

while (tmpsize) {
++counts[buf[--tmpsize]];
}

if (size < BUFFER) break;
}

if (ferror(file)) status = EXIT_FAILURE;
if (fclose(file)) status = EXIT_FAILURE;

if (status == EXIT_SUCCESS) {
clock_t stop;

for (i = 0; i < 26; ++i) {
total += GET_COUNT(counts, i);
}

for (i = 0; i < 26; ++i) {
printf("%c %%%.3f\n", (char)(i + 'a'),
(total) ? ((double)GET_COUNT(counts, i) / total)
* 100.0 : 0.0);
}

stop = clock();

printf("\nelapsed time: %lums\n",
(unsigned long)((((double)stop - (double)start)
/ CLOCKS_PER_SEC) * 1000.0));

return EXIT_SUCCESS;
}
}
}

fprintf(stderr, "file error...");

assert(0);

return EXIT_FAILURE;
}
</code>

Cheers & hth.,

- Alf

PS: It's possible to optimize even more but what's the point?

Ioannis Vranos · Jun 11, 2009

Alf said:
* Peter Jansson:

I made 2 small changes to Ioannis' program.

It sort of sped up... ;-)

Either you are joking which is OK, or you made a mistake because this has not any relevance with my code.

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Alf P. Steinbach · Jun 11, 2009

* Ioannis Vranos:

Either you are joking which is OK, or you made a mistake because this
has not any relevance with my code.

Oh, sorry, perhaps it was someone else's code.

Anyways, it was the code at top at the website, or perhaps second: I aimed my
mouse at "Ioannis" there but may have missed.

The modified code is about four hundred times faster, near as I can tell. ;-)

Cheers & hth.,

- Alf

James Kanze · Jun 11, 2009

news:a6dd6d1e-2ebd-4d93-8495-7945ecc3ad59@s16g2000vbp.googlegroups.com...

Click to expand...

[...]
(Also, the proposal says nothing about the encoding of the
input file. If he's running under Ubantu, the default
encoding is UTF-8, and properly handling case folding for
UTF-8 could be significantly more time consuming than a
naïve implementation which ignores the reality of accented
characters.)

Click to expand...

Well, then perhaps one can attempt to create a local table and
map in unsigned shorts/UNICODE values directly, or whatever as
concrete indexes into said table?

How to fold UTF-8 input rapidly would be the challenge. There
are over a million code points (and UTF-8 actually supports over
4 billion), so a Unicode character doesn't fit into a short on
most machines, and the tables would be very, very big. And very
sparce, since very few characters (proprotionally) actually have
a case difference.

I assume I can create sufficient sized table to map some
UNICODE character values directly into indexes of a local
counter array. Well, perhaps of course...

The approach I've taken, in such cases, is to use a tri,
starting with the first byte, and progressing byte by byte.
Whether this is optimal or not, I don't know; it's probably
optimal for the case where encodings greater than 127 are rare,
which is my case, and less optimal otherwise. The more
classical implementation, I think, is to convert the UTF-8 to
UTF-32, and use a two stage table (with default values, since in
many cases, there can be a good default value covering most
entries). In both cases, implementing something like islower or
toupper can require quite a bit of memory, however, even if only
a small subset of the character codes are involved. (None of
the larger scripts, like Chinese, have case.)

Frank Neuhaus · Jun 11, 2009

#define BUFFER 1025*1024U // !Alf
[...]
setvbuf( file, 0, _IONBF, 0 ); // !Alf

Just out of curiosity: Why 1025*1024 and not 1024*1024?

Cheers,
Frank

Alf P. Steinbach · Jun 11, 2009

* Frank Neuhaus:

#define BUFFER 1025*1024U // !Alf
[...]
setvbuf( file, 0, _IONBF, 0 ); // !Alf

Click to expand...

Just out of curiosity: Why 1025*1024 and not 1024*1024?

Just a typo.

Cheers,

- Alf

James Kanze · Jun 11, 2009

On Jun 11, 3:20 am, Ioannis Vranos <[email protected]> wrote:

Not to pick on your code particularly, but in general, in the
context of the challenge, what is done to ensure that the code
is legal and portable C++? That it works correctly with
multibyte characters, or EBCDIC, or with all of the more exotic
implementations?

It may be perfectly valid to not require all of this, but if so,
the acceptable restrictions should be stated up front.

If the encoding is 1-byte long (ASCII, ISO-8859-?, etc), it
works.

He's testing on a Linux machine. The default encoding under
Linux is UTF-8, a multibyte encoding. Unless otherwise stated,
it seems safe to assume that code which doesn't handle multibyte
encodings correctly is incorrect.

Now a newer approach taking word alignment in mind.

const int MAX_TEXT_LENGTH = 65536;

Either the name is a lie, or the constraint is unreasonable.
Files of several Gigabytes are frequent today, and I've seen
some which are in the Terabyte range.

int freq[256];

This is actually a count. And on most implementations, an int
will overflow long before you reach the end of a large file.

double pctFreq[256];
int count = 0;

inline
void checkRange(char c) {
if ((c >= 'A') && (c <= 'z')) {

Always false with EBCDIC ('A' is 193, and 'z' 169).

if ((c <= 'Z') || (c >= 'a')) {
++freq[c];
count++;
}
}
}

int main(int argc, char** argv) {
memset(freq, 0, sizeof (int) * 256);
memset(pctFreq, 0, sizeof (double) * 256);

And formally, this isn't guaranteed to set the values in pctFreq
to 0.0. This is actually what got me wondering to begin with:
reading a double initialized in this way is formally undefined
behavior, so your program is doesn't work. Except that I've
never heard of an implementation where it won't actually work.
If the requirement is "portable C++", then he must provide some
sort of filter which will detect this error, and reject your
program because of it. If the requirement isn't portable C++,
of course, then on Ubantu, I'd use mmap to read the file. And
if he's willing to restrict the input to the single byte
encodings of UTF-8, something like blargg's suggestion, combined
with mmap, should be close to unbeatable. Except that if he has
a multicore machine, it might be worth creating some additional
threads (or even forking new processes), each which handles a
contiguous block of data in the file, and then merging the
results. (Using different processes would even allow using mmap
for files which don't fit into memory.)

Until we know exactly what the requirements are, it's impossible
to write a program and know whether it fulfills these
requirements.

(FWIW: with any reasonably good implementation, I would expect
using std::fill to be faster than memset. As well as working
for all types, and not just for integral types.)

unsigned char text[MAX_TEXT_LENGTH];
int diff = 'a' - 'A';
std::ios_base::sync_with_stdio(false);
clock_t time1, time2;
time1 = clock();
std::ifstream file(argv[1], std::ios::in);
//std::ifstream file(argv[1], std::ios::in | std::ios::binary);
do {
file.read((char *) text, MAX_TEXT_LENGTH);
for (std::streamsize i(0), num(file.gcount()); i < num; ++i) {
checkRange(text);

Why not just:
++ freq[ text[ i ] ] ;
here, and do the case folding once, before the output.
(Blargg's suggestion, but the obvious "rapid" solution.)

}
} while (!file.eof());

Click to expand...

Almost. (I'm actually impressed that someone got this close to
correct.) If there's a hardware read error, you've got an
endless loop. Just use "while ( file )" as the condition.

file.close();
for (char i = 'A'; i <= 'Z'; i++) {
pctFreq = round((double(freq + freq[i + diff]) / count)
*10000) / 100;
std::cout << i << ": " << std::setw(5) << std::setfill('0') <<
std::fixed << std::setprecision(2) << pctFreq << "%\n";
}
time2 = clock();
double totalTimeInSeconds = (time2 - time1) / CLOCKS_PER_SEC;
std::cout << "\n\nThe whole process took " << std::fixed <<
totalTimeInSeconds << " seconds.\n";
return (0);
}

Click to expand...

Thomas J. Gritzan · Jun 11, 2009

Since this is comp.lang.c++, I will post a high-level C++ solution.

What's the high-level about this solution?

do
{
inputFile.read(buffer, sizeof(buffer));

for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer ];

}while(not inputFile.eof());

Using istream::eof or ::feof() as loop-condition is always wrong.
It might be that the program counts the last block twice, or that it
will be stuck in an infinite loop when the read fails:

http://www.parashift.com/c++-faq-lite/input-output.html#faq-15.5

Ioannis Vranos · Jun 11, 2009

Thomas said:
What's the high-level about this solution?

That it uses high-level standard library facilities (std::string, std::vector or valarray, std::ifstream,
etc), instead of practically reimplementing existing high-level standard library functiuonality in the code,
using low-level facilities like built-in arrays, getchar(), UCHAR_MAX, etc.

do
{
inputFile.read(buffer, sizeof(buffer));

for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer ];

}while(not inputFile.eof());

Click to expand...

Using istream::eof or ::feof() as loop-condition is always wrong.
It might be that the program counts the last block twice, or that it
will be stuck in an infinite loop when the read fails:

http://www.parashift.com/c++-faq-lite/input-output.html#faq-15.5

About the above code:

The worst case is when a reading reads the last character, and the next reading reads no characters. In this
case, gcount() will be zero, so there will be no side-efrects.

The link you provided talks about a while construct, and not a do-while construct. In the do-while case, the
EOF will have already been reached.

That is, the FAQ mentions:

---------- From the FAQ ----------

For example, the following code might have an off-by-one error with the count i:

int i = 0;
while (! std::cin.eof()) { // WRONG! (not reliable)
std::cin >> x;
++i;
// Work with x ...
}

What you really need is:

int i = 0;
while (std::cin >> x) { // RIGHT! (reliable)
++i;
// Work with x ...
}

----------------------------------

If you convert it to do-while the fist becomes:

int i = 0;

do
{
std::cin >> x;
++i;
// Work with x ...
}while (! std::cin.eof());

which catches the EOF, since the reading takes place before the condition.

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Ioannis Vranos · Jun 11, 2009

blargg said:
:
[...]

#define GET_COUNT(mp_self, mp_index) ( \
(mp_self)[(mp_index) + 'a'] + \
(mp_self)[(mp_index) + 'A'] \
)

Click to expand...

[...]

Why do you keep posting non-portable programs that assume ASCII?
toupper/tolower are your friends; use them! If you think they pose a
performance problem, you're doing the algorithm wrong, as you should only
combine lower and uppercase counts AFTER you've counted frequencies, not
during.

As a hint to Chris M. Thomasson:

We know that the basic character set is guaranteed to be implemented in the range [0, 127].

Counting all the chars read in the style

// The read characters are stored in the char array buffer.
char buffer[BUFSIZ]= {};

vector<unsigned long> charFrequencies(128);

// We read and store the characters in the char array buffer.
//
// sizeof(buffer)== BUFSIZ.
//
for(size_t i= 0; i< sizeof(buffer); ++i)
charFrequencies[ buffer ]++;

will work for all characters independently of the implementation of the *basic character set* (ASCII, EBCDIC,
whatever).

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Thomas J. Gritzan · Jun 11, 2009

Ioannis said:
Thomas said:

Ioannis said:

do
{
inputFile.read(buffer, sizeof(buffer));

for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer ];

}while(not inputFile.eof());

Click to expand...

Click to expand...

By the way, this "not" instead of "!" is horrible. I would write:

while (!inputFile.eof());

Which is still wrong because of eof. Correct is:

while (inputFile);

About the above code:

The worst case is when a reading reads the last character, and the next
reading reads no characters. In this case, gcount() will be zero, so
there will be no side-efrects.

Click to expand...

No. The worst case is when the stream sets failbit or badbit, because of
a hardware or read error, as James mentions elsethread.
In this case, the loop will never stop.

The link you provided talks about a while construct, and not a do-while
construct. In the do-while case, the EOF will have already been reached. [...]
If you convert it to do-while the fist becomes:

int i = 0;

do
{
std::cin >> x;
++i;
// Work with x ...
}while (! std::cin.eof());

Click to expand...

Assuming x equals i... (should we tell the author?)

which catches the EOF, since the reading takes place before the condition.

Click to expand...

If the read fails because of eof, x (or i) will contain garbage (like
the previous value) and this value will be evaluated (again).

If the input stream doesn't contain a number, the read fails and failbit
is set but not eof, so the loop will go infinitly.

So the rule is: Never use eof as loop condition.

Andrew Tomazos · Jun 11, 2009

How to fold UTF-8 input rapidly would be the challenge. There
are over a million code points (and UTF-8 actually supports over
4 billion), so a Unicode character doesn't fit into a short on
most machines, and the tables would be very, very big. And very
sparce, since very few characters (proprotionally) actually have
a case difference.

UTF-8 only supports about 1 million code points. 5 and 6 byte
sequences are explicitly illegal.
-Andrew.

Andrew Tomazos · Jun 11, 2009

He's testing on a Linux machine. The default encoding under
Linux is UTF-8, a multibyte encoding. Unless otherwise stated,
it seems safe to assume that code which doesn't handle multibyte
encodings correctly is incorrect.

Who defined that the "default encoding under Linux is UTF-8" ? In
what standard is this specified? Or where did you hear it?
-Andrew.

Alf P. Steinbach · Jun 11, 2009

* Peter Jansson:

Alf, I have included your modified program as well in the comparison. It
is now on the weblog page.

Uhm, I'm sorry to report that I was lead astray by two Windows "features":

1. In Windows, 'clock', used in that program to measure the time, reports
wall time, not process time.

2. For some obscure reason Windows XP, or perhaps it's the anti-virus, or
whatever, causes a process to run extra slowly the first few times. In
this case I ran the original program twice with a certain medium size
text file, then with modification it ran 400 times faster!, which I could
not believe was an arbitrary effect. But now testing it, and with a much
larger file, it is as you reported, about the same time.

Sincerely,
Peter Jansson

PS: How would you optimize it even more?

On my machine, Windows XP,

#define BUFFER 16*1024U

for the same program text that you have, yields the best times (statistically).

You might want to put that also in your list.

I don't have a bittorrent client so couldn't obtain your file, but with a half
gigabyte text file I constructed from a dictionary text, the processing is about
1/27 of the i/o time. So a marginal improvement, on that order, could be to read
e.g. four bytes (a long) at a time from the buffer to avoid memory accesses.

Just to be thorough I tested using the lower level Posix standard open etc., but
it turned out that it was not significantly faster for this than FILE*.

A less insignificant improvement could be to use a memory mapped file, because
that could conceivably leverage some OS virtual memory optimization. However,
while the Boost library does have some functionality for mapping a file to
memory, it beats me how to use that. It's not exactly well-documented.

Cheers & hth.,

- Alf (victim of Windows (the Eagles' newest greatest hit!))

Ioannis Vranos · Jun 11, 2009

Thomas said:
Ioannis said:

Thomas said:

Ioannis Vranos schrieb:
do
{
inputFile.read(buffer, sizeof(buffer));

for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer ];

}while(not inputFile.eof());

Click to expand...

Click to expand...

By the way, this "not" instead of "!" is horrible. I would write:

Horrible? "not" is a *built in* keyword, as "and" and "or". They are more user readable.

No. The worst case is when the stream sets failbit or badbit, because of
a hardware or read error, as James mentions elsethread.
In this case, the loop will never stop.

Click to expand...

My code doesn't take into account hardware failures. Also if I am not wrong, while(inputFile) does not check
for EOF, neither while(cin>> something) that the FAQ uses.

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Chris M. Thomasson · Jun 11, 2009

Alf P. Steinbach said:
* Ioannis Vranos:

Oh, sorry, perhaps it was someone else's code.

Its my code that I posted here:

http://groups.google.com/group/comp.lang.c++/msg/2f6fd68d71c477aa

Anyways, it was the code at top at the website, or perhaps second: I aimed
my mouse at "Ioannis" there but may have missed.

The modified code is about four hundred times faster, near as I can tell.
;-)

Four hundred times faster? Well, unfortunately I am not seeing the benefits
on my platform (Old P4 HyperThreaded 3.06gz 512mb ram). It actually goes
slower.

Chris M. Thomasson · Jun 11, 2009

blargg said:
Ioannis Vranos wrote:
[...]

We know that the basic character set is guaranteed to be implemented in
the range [0, 127]. [...]
char buffer[BUFSIZ]= {};
vector<unsigned long> charFrequencies(128); [...]
charFrequencies[ buffer ]++;

will work for all characters independently of the implementation of the
*basic character set* (ASCII, EBCDIC, whatever).

Click to expand...

And we can use UCHAR_MAX to ensure we cover ALL char values (cast char to
unsigned char first, of course). No need for any magic constants.

(BTW, your lines are longer than 80 characters, in case you didn't
realize)

I felt no need to cover anything other than plain ol ASCII. Sorry.

C programming in 2011	148	May 26, 2011
Ann: Registration for the Python Game Programming Challenge is nowopen!	0	Jul 29, 2005
Requirement @ Appulse	0	Jul 24, 2008
Some errors in MIT's intro C++ course	109	Sep 8, 2010
Looking for C++ Professionals (Bangalore)	4	Aug 17, 2007
C language now truly universal	0	Jan 1, 2011
Seek Contract Programming Work - 17 Years Experience	0	Feb 22, 2005
Fundamentals of Financial Management Concise 7e Brigham Houston	0	May 1, 2011

C++ programming challenge

Chris M. Thomasson

Chris M. Thomasson

Ioannis Vranos

Alf P. Steinbach

Ioannis Vranos

Alf P. Steinbach

James Kanze

Frank Neuhaus

Alf P. Steinbach

James Kanze

Thomas J. Gritzan

Ioannis Vranos

Ioannis Vranos

Thomas J. Gritzan

Andrew Tomazos

Andrew Tomazos

Alf P. Steinbach

Ioannis Vranos

Chris M. Thomasson

Chris M. Thomasson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads