C++ programming challenge

Ioannis Vranos · Jun 11, 2009

Actually, just changing the line :
ifstream inputFile(argv[argc -1]);
with
ifstream inputFile(argv[argc -1],std::ios::binary);

beats any other post (on my PC at least )

Here's the new code (just added the binary open mode)

Binary mode isn't the same as text mode (it will not work) in all platforms.

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Ioannis Vranos · Jun 11, 2009

Chris said:
Chris M. Thomasson said:

Chris M. Thomasson said:

Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

Here is a stupid simple solution; needs ASCII:
[...]

Click to expand...

Click to expand...

Here, let me repost this crappy little thing using a bigger buffer,
without using any asserts and calls to printf during the actual
processing phase:
_________________________________________________________________________
#include <stdio.h>

#define BUFFER 65536U

static
unsigned long
count_chars(
void const* buf,
size_t size,
unsigned long* counts
) {
unsigned char const* cur = buf;
unsigned char const* const end = cur + size;
unsigned long total = 0;
while (cur < end) {
if (*cur >= 'a' && *cur < 'a' + 26) {
++total;
++counts[*cur - 'a'];
} else if (*cur >= 'A' && *cur < 'A' + 26) {
++total;
++counts[*cur - 'A'];
}
++cur;
}
return total;
}

int main(void) {
size_t i;
FILE* file;
unsigned long total = 0;
unsigned long counts[26] = { 0 };

file = fopen("./data.txt", "rb");

if (file) {
size_t size;
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {
total += count_chars(buf, size, counts);
if (feof(file) || ferror(file)) break;
}

fclose(file);
}

for (i = 0; i < 26; ++i) {
printf("%c %%%f\n", i + 'a',
(total) ? ((double)counts / total) * 100 : 0.0);
}

return 0;
}
_________________________________________________________________________

It doesn't compile:

john@john-laptop:~/Projects/anjuta/cpp/src$ g++ main.cc -o foobar
main.cc: In function ‘long unsigned int count_chars(const void*, size_t, long unsigned int*)’:
main.cc:14: error: invalid conversion from ‘const void*’ to ‘const unsigned char*’
main.cc: In function ‘int main()’:
main.cc:53: warning: format ‘%c’ expects type ‘int’, but argument 2 has type ‘long unsigned int’
john@john-laptop:~/Projects/anjuta/cpp/src$

Also binary reading isn't the same with text reading in all platforms (that is, the program will not work).

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Ioannis Vranos · Jun 11, 2009

Ioannis said:
... and is platform specific (ASCII).

I mean your code is ASCII-specific. Not that text reading is ASCII-specific.

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Chris M. Thomasson · Jun 11, 2009

Ioannis Vranos said:
Chris said:

Chris M. Thomasson said:

Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

Here is a stupid simple solution; needs ASCII:
[...]

Click to expand...

Here, let me repost this crappy little thing using a bigger buffer,
without using any asserts and calls to printf during the actual
processing phase:
_________________________________________________________________________
[...]
_________________________________________________________________________

Click to expand...

It doesn't compile:

john@john-laptop:~/Projects/anjuta/cpp/src$ g++ main.cc -o foobar
main.cc: In function ‘long unsigned int count_chars(const void*, size_t,
long unsigned int*)’:
main.cc:14: error: invalid conversion from ‘const void*’ to ‘const
unsigned char*’
main.cc: In function ‘int main()’:
main.cc:53: warning: format ‘%c’ expects type ‘int’, but argument 2 has
type ‘long unsigned int’
john@john-laptop:~/Projects/anjuta/cpp/src$

Its a C program; use gcc.

Also binary reading isn't the same with text reading in all platforms
(that is, the program > will not work).

I don't see a real need to open for text reading. Oh well, shi% happens.

Chris M. Thomasson · Jun 11, 2009

Ioannis Vranos said:
Actually, just changing the line :
ifstream inputFile(argv[argc -1]);
with
ifstream inputFile(argv[argc -1],std::ios::binary);

beats any other post (on my PC at least )

Here's the new code (just added the binary open mode)

Click to expand...

Binary mode isn't the same as text mode (it will not work) in all
platforms.

I only care about the platforms in which it does work. ;^)

Andrew Tomazos · Jun 11, 2009

Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-freque...

With kind regards,
Peter Jansson

If you like these sort of mini-program competitions go and take a look
at the algorithm arena at TopCoder <http://www.topcoder.com>.

I warn you - you have to be prepared to have your ass handed to you by
some prepubescent Russian wunderkind. If you want your programming
ego intact *stay away* from that place.

-Andrew.

giannis t · Jun 11, 2009

giannis said:
giannis said:

check this out:

Click to expand...

#include <stdlib.h>
#include <iostream>
#include <string.h>
#include <iomanip>
#include <math.h>
#include <fstream>

Click to expand...

const int MAX_TEXT_LENGTH = 8192;

Click to expand...

int freq[256];
double pctFreq[256];

Click to expand...

int main(int argc, char** argv) {
for (int i = 0; i < 256; i++) {
freq = 0;
pctFreq = 0;
}
unsigned char text[MAX_TEXT_LENGTH];
int count = 0;
char c;
int diff = 'a' - 'A';
std::ios_base::sync_with_stdio(false);
std::ifstream file(argv[1], std::ios::binary);
for (; file.read((char*) text, MAX_TEXT_LENGTH) {
for (std::streamsize i(0), num(file.gcount()); i < num; ++i) {
c = text;
if ((c >= 'A') && (c <= 'Z')) {
++freq[c];
count++;
} else if ((c >= 'a') && (c <= 'z')) {
++freq[c - diff];
count++;
}

Click to expand...

}
}
file.close();
for (int i = ('A'); i <= 'Z'; i++) {
pctFreq = round((double(freq) / count)*10000) / 100;
std::cout << i << ": " << std::setw(5) << std::setfill('0') <<
std::fixed << std::setprecision(2) << pctFreq << "%\n";
}
return (EXIT_SUCCESS);
}

Click to expand...

Reading a text file as binary, is not guaranteed to work. It just happens that binary reading is implemented
as text reading in your platform.

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

If the encoding is 1-byte long (ASCII, ISO-8859-?, etc), it works. Now
a newer approach taking word alignment in mind.

#include <stdlib.h>
#include <iostream>
#include <string.h>
#include <iomanip>
#include <math.h>
#include <fstream>
#include <ctime>

const int MAX_TEXT_LENGTH = 65536;

int freq[256];
double pctFreq[256];
int count = 0;

inline
void checkRange(char c) {
if ((c >= 'A') && (c <= 'z')) {
if ((c <= 'Z') || (c >= 'a')) {
++freq[c];
count++;
}
}
}

int main(int argc, char** argv) {
memset(freq, 0, sizeof (int) * 256);
memset(pctFreq, 0, sizeof (double) * 256);
unsigned char text[MAX_TEXT_LENGTH];
int diff = 'a' - 'A';
std::ios_base::sync_with_stdio(false);
clock_t time1, time2;
time1 = clock();
std::ifstream file(argv[1], std::ios::in);
//std::ifstream file(argv[1], std::ios::in | std::ios::binary);
do {
file.read((char *) text, MAX_TEXT_LENGTH);
for (std::streamsize i(0), num(file.gcount()); i < num; ++i) {
checkRange(text);
}
} while (!file.eof());
file.close();
for (char i = 'A'; i <= 'Z'; i++) {
pctFreq = round((double(freq + freq[i + diff]) / count)
*10000) / 100;
std::cout << i << ": " << std::setw(5) << std::setfill('0') <<
std::fixed << std::setprecision(2) << pctFreq << "%\n";
}
time2 = clock();
double totalTimeInSeconds = (time2 - time1) / CLOCKS_PER_SEC;
std::cout << "\n\nThe whole process took " << std::fixed <<
totalTimeInSeconds << " seconds.\n";
return (0);
}

Chris M. Thomasson · Jun 11, 2009

Chris M. Thomasson said:
Peter Jansson said:

Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

Click to expand...

Well, try this one out too:
_______________________________________________________________________________
#include <stdio.h>
#include <limits.h>

#define BUFFER 65536U

int main(int argc, char** argv) {
size_t i;
unsigned long total = 0;
unsigned long counts[UCHAR_MAX + 1] = { 0 };
FILE* file = fopen(argv[1], "rb");

if (file) {
size_t size;
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {

while (size) {
--size;
++counts[buf[size]];
}

if (size < BUFFER) {
if (feof(file) || ferror(file)) break;
}

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

lol. How is the that last if block ever going to executed!

[...]
_______________________________________________________________________________

It just might beat my previous submission by a little bit...

This one actually tries to check for errors:
_______________________________________________________________________________
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <limits.h>

#if ! defined (FILE_MODE)
# define FILE_MODE "rb"
#endif

typedef char sassert[
'Z' - 'A' == 25 &&
'z' - 'a' == 25 ? 1 : -1
];

#define BUFFER 65536U

#define GET_COUNT(mp_self, mp_index) ( \
(mp_self)[(mp_index) + 'a'] + \
(mp_self)[(mp_index) + 'A'] \
)

int main(int argc, char** argv) {
if (argc > 1) {
size_t i;
unsigned long total = 0;
unsigned long counts[UCHAR_MAX + 1] = { 0 };
FILE* file = fopen(argv[1], FILE_MODE);

if (file) {
size_t size;
int status = EXIT_SUCCESS;
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {
size_t tmpsize = size;

while (tmpsize) {
++counts[buf[--tmpsize]];
}

if (size < BUFFER) break;
}

if (ferror(file)) status = EXIT_FAILURE;
if (fclose(file)) status = EXIT_FAILURE;

if (status == EXIT_SUCCESS) {
for (i = 0; i < 26; ++i) {
total += GET_COUNT(counts, i);
}

for (i = 0; i < 26; ++i) {
printf("%c %%%.3f\n", (char)(i + 'a'),
(total) ? ((double)GET_COUNT(counts, i) / total) * 100.0 : 0.0);
}

return EXIT_SUCCESS;
}
}
}

fprintf(stderr, "file error...");

assert(0);

return EXIT_FAILURE;
}
_______________________________________________________________________________

GRRRR!

giannis t · Jun 11, 2009

This is the best I have:

#include <stdlib.h>
#include <iostream>
#include <string.h>
#include <iomanip>
#include <math.h>
#include <fstream>
#include "CStopWatch.hpp"

const int MAX_TEXT_LENGTH = 65536;

int freq[256];
double pctFreq[256];
int count = 0;

int main(int argc, char** argv) {
memset(freq, 0, sizeof (int) * 256);
memset(pctFreq, 0, sizeof (double) * 256);
unsigned char text[MAX_TEXT_LENGTH];
int diff = 'a' - 'A';
std::ios_base::sync_with_stdio(false);
CStopWatch s;
s.startTimer();
//std::ifstream file(argv[1], std::ios::in);
std::ifstream file(argv[1], std::ios::in | std::ios::binary);
do {
file.read((char *) text, MAX_TEXT_LENGTH);
for (std::streamsize i(0), num(file.gcount()); i < num; ++i) {
++freq[text];
}
} while (!file.eof());
file.close();
for (char i = 'A'; i <= 'Z'; i++) {
freq += freq[i + diff];
count += freq;
}
for (char i = 'A'; i <= 'Z'; i++) {
pctFreq = round((double(freq) / count)*10000) / 100;
std::cout << i << ": " << std::setw(5) << std::setfill('0') <<
std::fixed << std::setprecision(2) << pctFreq << "%\n";
}
s.stopTimer();
std::cout << "Time [microseconds]: " << s.getElapsedTime()*1.0E6
<< '\n';
return (EXIT_SUCCESS);
}

Chris M. Thomasson · Jun 11, 2009

Here is my version with some timing code (e.g., `clock()'):
________________________________________________________________
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <assert.h>
#include <limits.h>

#if ! defined (FILE_MODE)
# define FILE_MODE "rb"
#endif

typedef char sassert[
'Z' - 'A' == 25 &&
'z' - 'a' == 25 ? 1 : -1
];

#define BUFFER 65536U

#define GET_COUNT(mp_self, mp_index) ( \
(mp_self)[(mp_index) + 'a'] + \
(mp_self)[(mp_index) + 'A'] \
)

int main(int argc, char** argv) {
if (argc > 1) {
size_t i;
unsigned long total = 0;
unsigned long counts[UCHAR_MAX + 1] = { 0 };
clock_t const start = clock();
FILE* file = fopen(argv[1], FILE_MODE);

if (file) {
size_t size;
int status = EXIT_SUCCESS;
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {
size_t tmpsize = size;

while (tmpsize) {
++counts[buf[--tmpsize]];
}

if (size < BUFFER) break;
}

if (ferror(file)) status = EXIT_FAILURE;
if (fclose(file)) status = EXIT_FAILURE;

if (status == EXIT_SUCCESS) {
clock_t stop;

for (i = 0; i < 26; ++i) {
total += GET_COUNT(counts, i);
}

for (i = 0; i < 26; ++i) {
printf("%c %%%.3f\n", (char)(i + 'a'),
(total) ? ((double)GET_COUNT(counts, i) / total)
* 100.0 : 0.0);
}

stop = clock();

printf("\nelapsed time: %lums\n",
(unsigned long)((((double)stop - (double)start)
/ CLOCKS_PER_SEC) * 1000.0));

return EXIT_SUCCESS;
}
}
}

fprintf(stderr, "file error...");

assert(0);

return EXIT_FAILURE;
}
________________________________________________________________

James Kanze · Jun 11, 2009

Peter Jansson wrote:

[...]
This seems silly. Either you have an array integers and
increment based on the incoming character (read in large
chunks for speed), then later combine the upper- and
lower-case letters, or simply print out a pre-computed list of
frequencies, as the rules state that a known file will be fed
to the program. In the former (non-cheating) case, the
program's run-time will be based almost entirely on I/O speed,
and thus simply a measure of the method used to read the file.
In C++, there are only two ways: iostreams and C-style FILEs.

And any differences in speed between iostreams and FILEs will
depend on the library implementation, not my code.

On the other hand, he did say that the actual test would be run
on Ubantu, a flavor of Linux. So the fastest solution is
probably to use mmap.

But as you say, the test is silly. The fastest code on one
machine will not be the fastest on another. The best solution
is to write the code in the cleanest way possible, then profile
the results.

(Also, the proposal says nothing about the encoding of the input
file. If he's running under Ubantu, the default encoding is
UTF-8, and properly handling case folding for UTF-8 could be
significantly more time consuming than a naïve implementation
which ignores the reality of accented characters.)

James Kanze · Jun 11, 2009

Peter Jansson wrote:
[...]

Click to expand...

[...]

Click to expand...

Here's a simple solution that reads from stdin and doesn't do
any timing (for that, just run with the Unix "time" command).
I see no reason to complicate it with iostreams.

How would iostream represent any additional complication? I
think the code would actually be slightly simpler.

If speed were really a problem, I'd experiment with mmap.

#include <limits.h>
#include <stdio.h>
#include <ctype.h>

int main()
{
size_t f [UCHAR_MAX+1] = { 0 };
size_t size = 0;
int c;

while ( (c = getchar()) != EOF )

With iostream, these two lines would be:

char c ;
while ( std::cin.get( c ) )

That looks even simpler than what you've written.

Juha Nieminen · Jun 11, 2009

James said:
The fastest code on one
machine will not be the fastest on another. The best solution
is to write the code in the cleanest way possible, then profile
the results.

I have yet to see a compiler/system where C++ streams are actually
faster then C streams. Thus it's pretty safe to say that if you want
good I/O speed, using C streams will give you an edge in most, if not
all systems. Even if there was a system where C++ streams are as fast as
C streams, you wouldn't be losing anything in speed even there.

(Of course a different issue is that with C streams your code will
become more complicated and error-prone.)

Chris M. Thomasson · Jun 11, 2009

James Kanze said:
[...]

(Also, the proposal says nothing about the encoding of the input
file. If he's running under Ubantu, the default encoding is
UTF-8, and properly handling case folding for UTF-8 could be
significantly more time consuming than a naïve implementation
which ignores the reality of accented characters.)

Well, then perhaps one can attempt to create a local table and map in
unsigned shorts/UNICODE values directly, or whatever as concrete indexes
into said table? I assume I can create sufficient sized table to map some
UNICODE character values directly into indexes of a local counter array.
Well, perhaps of course...

:^)

Chris M. Thomasson · Jun 11, 2009

Juha Nieminen said:
James said:

The fastest code on one
machine will not be the fastest on another. The best solution
is to write the code in the cleanest way possible, then profile
the results.

Click to expand...

[...]

(Of course a different issue is that with C streams your code will
become more complicated and error-prone.)

Perhaps... ;^o

Yikes!!!!!!!!!!!!

Chris M. Thomasson · Jun 11, 2009

Chris M. Thomasson said:
James Kanze said:

[...]

Click to expand...

(Also, the proposal says nothing about the encoding of the input
file. If he's running under Ubantu, the default encoding is
UTF-8, and properly handling case folding for UTF-8 could be
significantly more time consuming than a naïve implementation
which ignores the reality of accented characters.)

Click to expand...

Well, then perhaps one can attempt to create a local table and map in
unsigned shorts/UNICODE values directly, or whatever as concrete indexes
into said table? I assume I can create sufficient sized table to map some
UNICODE character values directly into indexes of a local counter array.
Well, perhaps of course...

Assume NO hash collisions...

Bo Persson · Jun 11, 2009

giannis said:
inline
void checkRange(char c) {
if ((c >= 'A') && (c <= 'z')) {
if ((c <= 'Z') || (c >= 'a')) {
++freq[c];
count++;
}
}
}

You haven't seen this page, have you?

http://en.wikipedia.org/wiki/EBCDIC_500

Bo Persson

Ioannis Vranos · Jun 11, 2009

Chris said:
Ioannis Vranos said:

Actually, just changing the line :
ifstream inputFile(argv[argc -1]);
with
ifstream inputFile(argv[argc -1],std::ios::binary);

beats any other post (on my PC at least )

Here's the new code (just added the binary open mode)

Click to expand...

Binary mode isn't the same as text mode (it will not work) in all
platforms.

Click to expand...

I only care about the platforms in which it does work. ;^)

OK, I have no problem if you can't provide a better solution, after all it is just a challenge. When I say it
will not work, I mean you will get inaccurate results, and there is a chance that the program already does not
work in your platform, that is there is a chance you are already getting inacaccurate results.

Second, are you the same person with the poster I replied ([email protected])?

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Ioannis Vranos · Jun 11, 2009

Peter said:
Dear news group,

I have now included all programs submitted so far in the challenge. 10
programs up to now, with 9 of them written in this group thread. You
can view the most recent results here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

Sincerely,
Peter Jansson

There is an error in your blog. Code named "Ioannis A. Vranos, Code 3 (C++): 2 s" isn't mine.

Also the "Ioannis A. Vranos, Code 4 (C++): 1 s"

is my original code improved (you may say corrected),

so you can erase the original code "Ioannis A. Vranos, Code 2 (C++): 13 s" and "Ioannis A. Vranos, Code 1
(C++): 13 s" (both of which were essentially the same and thus were in the same message).

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Ioannis Vranos · Jun 11, 2009

Corrected:

==> There is an error in your blog. Code named "Ioannis A. Vranos, Code 4
==> (C++): 2 s" isn't mine.==> Also the "Ioannis A. Vranos, Code 3 (C++): 2 s"==> is my original code improved (you may say corrected),

so you can erase the original code "Ioannis A. Vranos, Code 2 (C++): 13
s" and "Ioannis A. Vranos, Code 1 (C++): 13 s" (both of which were
essentially the same and thus were in the same message).

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

C programming in 2011	148	May 26, 2011
Ann: Registration for the Python Game Programming Challenge is nowopen!	0	Jul 29, 2005
Requirement @ Appulse	0	Jul 24, 2008
Some errors in MIT's intro C++ course	109	Sep 8, 2010
Looking for C++ Professionals (Bangalore)	4	Aug 17, 2007
C language now truly universal	0	Jan 1, 2011
Seek Contract Programming Work - 17 Years Experience	0	Feb 22, 2005
Fundamentals of Financial Management Concise 7e Brigham Houston	0	May 1, 2011

C++ programming challenge

Ioannis Vranos

Ioannis Vranos

Ioannis Vranos

Chris M. Thomasson

Chris M. Thomasson

Andrew Tomazos

giannis t

Chris M. Thomasson

giannis t

Chris M. Thomasson

James Kanze

James Kanze

Juha Nieminen

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Bo Persson

Ioannis Vranos

Ioannis Vranos

Ioannis Vranos

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads