C++ programming challenge

C

Chris M. Thomasson

Hendrik Schober said:
Chris said:
[...]
I only care about the platforms in which it works.

And what about customers?

You know of companies that are interested in the programs that this contest
has spawned?



Do you also only care about
customers for which it works? (I'd starve my kids doing
it this way.)

I do not foresee anybody purchasing the little program I created for this
contest.
 
J

Jerry Coffin

[ ... ]
Here's a simple solution that reads from stdin and doesn't do any timing

Simple, but wrong. The percentages it computes will (at least usually)
be incorrect.

[ ... ]
while ( (c = getchar()) != EOF )
{
f [c]++;
size++;

At least as I read the proposal, only letters should be counted toward
the total.
 
J

Jerry Coffin

(e-mail address removed)>, (e-mail address removed)
says...
Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

Yet another bit of code:

#include <iostream>
#include <fstream>
#include <vector>
#include <numeric>
#include <iomanip>

class counter {
std::vector<double> counts;
long total;
public:
static const int num = 'z'-'a'+1;

counter() : counts(num), total(0) {}

counter operator+(char c) {
char ch = tolower(c);
if (isalpha(ch)) {
++total;
++counts[ch-'a'];
}
return *this;
}

friend std::eek:stream &operator<<(std::eek:stream &os, counter const &c)
{
for (int i=0; i<num; ++i)
if (isalpha('a'+i))
// The single most complex part of the program is getting the
// output formatted as your web site shows it!
os << char('a'+ i) << " "
<< std::fixed
<< std::setw(6)
<< std::setprecision(3)
<< c.counts/c.total*100.0 << "%\n";
return os;
}
};

int main(int argc, char **argv) {
std::ifstream input(argv[1]);

counter &counts = std::accumulate(
std::istream_iterator<char>(input),
std::istream_iterator<char>(),
counter());

std::cout << counts;
return 0;
}

I suppose it's open to argument whether I might be abusing
std::accumulate and/or operator overloading. I (obviously) don't really
think so, but I'll admit that especially in the latter case there's room
for question.

I don't have the right sort of machine handy to test it with, but I
believe this should work with something like EBCDIC. Making it work
correctly with something like almost any Unicode encoding would take
substantial modifications.
 
C

Chris M. Thomasson

Jonathan Lee said:
If this has already been brought up then please ignore me, but how are
you ensuring the programs are reading from disk? For example, the
current program listed as fastest on your blog clocks in at .13s. So I
downloaded the source, downloaded the text file and unzipped it, and
ran it. Result: 0.17s. Fine, but I notice that my hard disk light
doesn't flash. So I restart the computer and run it again. Result:
6.1s.

I suspect that some of the times you've posted may be benefiting from
the fact that the file is in memory, and others are genuinely reading
it from disk.

It might be interesting to see a posted result that executed on a system in
which portions of the target file were not in memory. Then, run the program
10 or 20 times; take the average time (e.g., time command) and post that
result as well.
 
J

Jerry Coffin

[ ... ]
counter operator+(char c) {

Oops, that should be:

counter &operator+(char c) {

Given the degree to which it's I/O bound, it probably doesn't make any
real difference, but at least in theory this should improve speed a bit.
 
C

Chris M. Thomasson

Jerry Coffin said:
(e-mail address removed)>, (e-mail address removed)
says...
Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here: [...]

I suppose it's open to argument whether I might be abusing
std::accumulate and/or operator overloading. I (obviously) don't really
think so, but I'll admit that especially in the latter case there's room
for question.

I don't have the right sort of machine handy to test it with, but I
believe this should work with something like EBCDIC.

Well, imagine a contrived character set in which the following:

static const int num = 'z'-'a'+1;

produced a negative number?



Making it work
correctly with something like almost any Unicode encoding would take
substantial modifications.

Indeed.
 
J

Jerry Coffin

"Jerry Coffin" <[email protected]> wrote in message

[ ... ]
Well, imagine a contrived character set in which the following:

static const int num = 'z'-'a'+1;

produced a negative number?

Yes, it's possible to theorize an 8-bit character set for which it
wouldn't work -- but in reality, I'm pretty sure there's never been one
that would cause a problem. I hardly think it's worth worrying a great
deal about when there are so many real character sets that would cause
problems.

OTOH, a fair number of the other solutions that have been posted have
been far more dependent on the details of encoding for considerably less
benefit.
 
I

Ioannis Vranos

Jerry said:
[ ... ]
counter operator+(char c) {

Oops, that should be:

counter &operator+(char c) {

Given the degree to which it's I/O bound, it probably doesn't make any
real difference, but at least in theory this should improve speed a bit.


Your code does not compile:


john@john-laptop:~/Projects/anjuta/cpp/src$ g++ main.cc -o foobar
main.cc: In function ‘int main(int, char**)’:
main.cc:43: error: ‘istream_iterator’ is not a member of ‘std’
main.cc:43: error: expected primary-expression before ‘char’
main.cc:44: error: ‘istream_iterator’ is not a member of ‘std’
main.cc:44: error: expected primary-expression before ‘char’
john@john-laptop:~/Projects/anjuta/cpp/src$





--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
C

Chris M. Thomasson

Peter Jansson said:
I agree. Do you know how to remove the target file from memory before
executing the program?

Reboot should do it, or read a huge file, before each subsequent test.
 
C

Chris M. Thomasson

Ioannis Vranos said:
Opening the text file in binary mode is ==> definitely a mistake, so I
think you should remove all such answers from the valid answers.

Even programs which allow one to customize the mode via pre-defined macros?
 
A

Alf P. Steinbach

* Ioannis Vranos:
Opening the text file in binary mode is ==> definitely a mistake, so I
think you should remove all such answers from the valid answers.

In principle you may be right, because binary mode won't necessarily work on a
system where a file is not simply a sequence of bytes.

But in practice, for systems like Windows and *nix, binary mode just works, and
for systems like Windows it's required to avoid time-consuming conversions of
e.g. newlines on some systems.

Since the programs are not counting lines binary mode has no other impact.


Cheers,

- Alf
 
C

Chris M. Thomasson

Chris M. Thomasson said:
Reboot should do it, or read a huge file, before each subsequent test.

Or perhaps create an individual copy of the input datum on a per-test basis,
reboot, and recomputed.



Anyway, I think the posted timing results are more realistic/consistent now.
However, it would be fairly interesting, IMVHO or course, to see times that
represent the average time a program consumes across 20+ runs... Forget
awesome cached results, execute program numerous times and give average.
IMVVHO, that would seem to be "fair" enough.

Forget individual timing codes. Perhaps you should strip all timing code out
of each entry and blast them all with the UNIX time command?

Perhaps execute a phantom program to get file in cache before executing the
"first" contest submission...
 
C

Chris M. Thomasson

Chris M. Thomasson said:
Or perhaps create an individual copy of the input datum on a per-test
basis, reboot, and recomputed.

Perhaps you should increase the target file size by a couple of gigs. Say,
50 gigs+? Then programs should be hitting hard drive sooner or later...
 
C

Chris M. Thomasson

Jerry Coffin said:
"Jerry Coffin" <[email protected]> wrote in message

[ ... ]
Well, imagine a contrived character set in which the following:

static const int num = 'z'-'a'+1;

produced a negative number?

Yes, it's possible to theorize an 8-bit character set for which it
wouldn't work -- but in reality, I'm pretty sure there's never been one
that would cause a problem. I hardly think it's worth worrying a great
deal about when there are so many real character sets that would cause
problems.

100% Agreed.




OTOH, a fair number of the other solutions that have been posted have
been far more dependent on the details of encoding for considerably less
benefit.

YES! ;^o


Slapping self!



Ouch...
 
I

Ioannis Vranos

Peter said:
I will keep the binary mode programs in the list since it interesting to
see if it makes any speed difference compared to the non-binary mode
programs.

Sincerely,
Peter Jansson


Data stored in binary files, are implementation-dependent. That is, if you store something in a binary file,
and then you read it as it is in another platform (e.g. with the same program recompiled either with a
different compiler and/or at a different OS), there is high probability that the binary file will be incompatible.

Text files on the other hand are portable.


Opening a text file with binary mode has two possibilities. Either binary mode is implemented as text mode and
the program works, or you are reading trash.


That's why we have "text mode" and "binary mode". So, opening the text file in binary mode is clearly a mistake.



--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
J

Jerry Coffin

[ ... ]
I had to #include <iterator>

Yup -- should have been there, though it happened to work as it was with
the compiler I tested with.
and also changed in your main a little bit:

Perhaps a slightly larger change would be worthwhile:

int main(int argc, char **argv) {
static char buffer[20*1024*1024];

std::ifstream input(argv[1]);
input.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
input.sync_with_stdio(false);

std::cout << std::accumulate(
std::istream_iterator<char>(input),
std::istream_iterator<char>(),
counter());
return 0;
}

It's hard to guess exactly how much difference those changes will make,
but they might be significant. Of course, it's all a bit pointless -- if
you're really processing large files often enough to care about speed,
some non-portable code will usually make a much larger difference than
anything like this.
 
J

Jerry Coffin

Given that the problem is largely I/O bound, there's another possibility
for reading the data into memory quickly that should probably be tested.
Try replacing the main from my previous program with the following:

int main(int argc, char **argv) {
std::ifstream input(argv[1]);
std::stringstream data;

data << input.rdbuf();
std::string const &d=data.str();

std::cout << std::accumulate(d.begin(), d.end(), counter());
return 0;
}

To use this, you'll have to add:

#include <sstream>
#include <string>

up with the rest of the #include's, but the rest of the program remains
the same. At least on my computer, using the compilers I have handy,
this has a substantial performance advantage. It's hard to guess what
it'll do on your target though -- if your computer doesn't have enough
free memory to hold the entire input file, it could cause thrashing,
leading to truly awful performance.
 
I

Ioannis Vranos

Peter said:
Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

With kind regards,
Peter Jansson


This is my final, corrected ISO C++ version. Please erase all my previous submissions from your site, and
include only this:


#include <valarray>
#include <fstream>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#include <cctype>
#include <ctime>


int main(int argc, char **argv)
{
using namespace std;


// Warning: long double has problems with MINGW compiler for Windows.
//
// The C++ basic character set is using the value range [0, 127].
// If we used vector<long double>, it would not have any run-time difference in any modern compiler.
valarray<long double> characterFrequencies(128);


// The array where the read characters will be stored.
char buffer[4* BUFSIZ]= {};


// If argc!= 2, then either the number of arguments is not correct, or the platform does not
// support arguments.
if(argc!= 2)
{
cerr<< "\nUsage: "<< argv[0]<< " fileNameToRead\n\n";

return EXIT_FAILURE;
}




// We disable synchronisation with stdio, to speed up C++ standard I/O.
ios_base::sync_with_stdio(false);

cout<< fixed;


string characters= "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

clock_t time1, time2;



// We start timing.
time1= clock();

// We open the file
ifstream inputFile(argv[argc -1]);


// An error happened
if(not inputFile)
{
cerr<< "\nCould not open file for reading, exiting...\n\n";

return EXIT_FAILURE;
}


do
{
inputFile.read(buffer, sizeof(buffer));


for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer ];

}while(inputFile);



// Since rule 1 is: "Your program should be case insensitive when it counts letters",
// we add the results of lowercase characters and their equivallent uppercase letters together.
cout<< "\n\n\nThe letter frequencies are:\n";


long double totalcharacterFrequencies= 0;

for(string::size_type i= 0; i< characters.size(); ++i)
totalcharacterFrequencies+= characterFrequencies[ characters ]+ characterFrequencies[
tolower(characters) ];


for(string::size_type i= 0; i< characters.size(); ++i)
cout<< characters<< ": "<< (characterFrequencies[ characters ]+ characterFrequencies[
tolower(characters) ])/ totalcharacterFrequencies* 100<< "%\n";


// We "stop" timing.
time2= clock();


// We convert the timing to seconds.
double totalTimeInSeconds= static_cast<double>(time2- time1)/ CLOCKS_PER_SEC;


cout<<"\n\nThe whole process took "<< totalTimeInSeconds<< " seconds.\n";

cout<<"\n\nHave a nice day!\n";
}



--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
I

Ioannis Vranos

Corrected: Please include this one, and NOT the previous one.



#include <valarray>
#include <fstream>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#include <cctype>
#include <ctime>


int main(int argc, char **argv)
{
using namespace std;


// The C++ basic character set is using the value range [0, 127].
// If we used vector<unsigned long>, it would not have any run-time difference in any modern compiler.
valarray<unsigned long> characterFrequencies(128);


// The array where the read characters will be stored.
char buffer[4* BUFSIZ]= {};


// If argc!= 2, then either the number of arguments is not correct, or the platform does not
// support arguments.
if(argc!= 2)
{
cerr<< "\nUsage: "<< argv[0]<< " fileNameToRead\n\n";

return EXIT_FAILURE;
}




// We disable synchronisation with stdio, to speed up C++ standard I/O.
ios_base::sync_with_stdio(false);

cout<< fixed;


string characters= "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

clock_t time1, time2;



// We start timing.
time1= clock();

// We open the file
ifstream inputFile(argv[argc -1]);


// An error happened
if(not inputFile)
{
cerr<< "\nCould not open file for reading, exiting...\n\n";

return EXIT_FAILURE;
}


do
{
inputFile.read(buffer, sizeof(buffer));


for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer ];

}while(inputFile);



// Since rule 1 is: "Your program should be case insensitive when it counts letters",
// we add the results of lowercase characters and their equivallent uppercase letters together.
cout<< "\n\n\nThe letter frequencies are:\n";


unsigned long totalcharacterFrequencies= 0;

for(string::size_type i= 0; i< characters.size(); ++i)
totalcharacterFrequencies+= characterFrequencies[ characters ]+ characterFrequencies[
tolower(characters) ];


for(string::size_type i= 0; i< characters.size(); ++i)
cout<< characters<< ": "<< static_cast<double>(characterFrequencies[ characters ]+
characterFrequencies[
tolower(characters) ])/ totalcharacterFrequencies* 100<< "%\n";


// We "stop" timing.
time2= clock();


// We convert the timing to seconds.
double totalTimeInSeconds= static_cast<double>(time2- time1)/ CLOCKS_PER_SEC;


cout<<"\n\nThe whole process took "<< totalTimeInSeconds<< " seconds.\n";

cout<<"\n\nHave a nice day!\n";
}




--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,159
Messages
2,570,881
Members
47,418
Latest member
NoellaXku

Latest Threads

Top