C++ programming challenge

I

Ioannis Vranos

Actually, just changing the line :
ifstream inputFile(argv[argc -1]);
with
ifstream inputFile(argv[argc -1],std::ios::binary);

beats any other post (on my PC at least :))

Here's the new code (just added the binary open mode)


Binary mode isn't the same as text mode (it will not work) in all platforms.



--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
I

Ioannis Vranos

Chris said:
Chris M. Thomasson said:
Chris M. Thomasson said:
Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html


Here is a stupid simple solution; needs ASCII:
[...]

Here, let me repost this crappy little thing using a bigger buffer,
without using any asserts and calls to printf during the actual
processing phase:
_________________________________________________________________________
#include <stdio.h>


#define BUFFER 65536U


static
unsigned long
count_chars(
void const* buf,
size_t size,
unsigned long* counts
) {
unsigned char const* cur = buf;
unsigned char const* const end = cur + size;
unsigned long total = 0;
while (cur < end) {
if (*cur >= 'a' && *cur < 'a' + 26) {
++total;
++counts[*cur - 'a'];
} else if (*cur >= 'A' && *cur < 'A' + 26) {
++total;
++counts[*cur - 'A'];
}
++cur;
}
return total;
}


int main(void) {
size_t i;
FILE* file;
unsigned long total = 0;
unsigned long counts[26] = { 0 };

file = fopen("./data.txt", "rb");

if (file) {
size_t size;
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {
total += count_chars(buf, size, counts);
if (feof(file) || ferror(file)) break;
}

fclose(file);
}

for (i = 0; i < 26; ++i) {
printf("%c %%%f\n", i + 'a',
(total) ? ((double)counts / total) * 100 : 0.0);
}

return 0;
}
_________________________________________________________________________



It doesn't compile:


john@john-laptop:~/Projects/anjuta/cpp/src$ g++ main.cc -o foobar
main.cc: In function ‘long unsigned int count_chars(const void*, size_t, long unsigned int*)’:
main.cc:14: error: invalid conversion from ‘const void*’ to ‘const unsigned char*’
main.cc: In function ‘int main()’:
main.cc:53: warning: format ‘%c’ expects type ‘int’, but argument 2 has type ‘long unsigned int’
john@john-laptop:~/Projects/anjuta/cpp/src$



Also binary reading isn't the same with text reading in all platforms (that is, the program will not work).


--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
C

Chris M. Thomasson

Ioannis Vranos said:
Chris said:
Chris M. Thomasson said:
Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

Here is a stupid simple solution; needs ASCII:
[...]

Here, let me repost this crappy little thing using a bigger buffer,
without using any asserts and calls to printf during the actual
processing phase:
_________________________________________________________________________
[...]
_________________________________________________________________________


It doesn't compile:


john@john-laptop:~/Projects/anjuta/cpp/src$ g++ main.cc -o foobar
main.cc: In function ‘long unsigned int count_chars(const void*, size_t,
long unsigned int*)’:
main.cc:14: error: invalid conversion from ‘const void*’ to ‘const
unsigned char*’
main.cc: In function ‘int main()’:
main.cc:53: warning: format ‘%c’ expects type ‘int’, but argument 2 has
type ‘long unsigned int’
john@john-laptop:~/Projects/anjuta/cpp/src$


Its a C program; use gcc.



Also binary reading isn't the same with text reading in all platforms
(that is, the program > will not work).

I don't see a real need to open for text reading. Oh well, shi% happens.
 
C

Chris M. Thomasson

Ioannis Vranos said:
Actually, just changing the line :
ifstream inputFile(argv[argc -1]);
with
ifstream inputFile(argv[argc -1],std::ios::binary);

beats any other post (on my PC at least :))

Here's the new code (just added the binary open mode)


Binary mode isn't the same as text mode (it will not work) in all
platforms.

I only care about the platforms in which it does work. ;^)
 
A

Andrew Tomazos

Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-freque...

With kind regards,
Peter Jansson

If you like these sort of mini-program competitions go and take a look
at the algorithm arena at TopCoder <http://www.topcoder.com>.

I warn you - you have to be prepared to have your ass handed to you by
some prepubescent Russian wunderkind. If you want your programming
ego intact *stay away* from that place. :)
-Andrew.
 
G

giannis t

giannis said:
check this out:
#include <stdlib.h>
#include <iostream>
#include <string.h>
#include <iomanip>
#include <math.h>
#include <fstream>
const int MAX_TEXT_LENGTH = 8192;
int freq[256];
double pctFreq[256];
int main(int argc, char** argv) {
    for (int i = 0; i < 256; i++) {
        freq = 0;
        pctFreq = 0;
    }
    unsigned char text[MAX_TEXT_LENGTH];
    int count = 0;
    char c;
    int diff = 'a' - 'A';
    std::ios_base::sync_with_stdio(false);
    std::ifstream file(argv[1], std::ios::binary);
    for (; file.read((char*) text, MAX_TEXT_LENGTH);) {
        for (std::streamsize i(0), num(file.gcount()); i < num; ++i) {
            c = text;
            if ((c >= 'A') && (c <= 'Z')) {
                ++freq[c];
                count++;
            } else if ((c >= 'a') && (c <= 'z')) {
                ++freq[c - diff];
                count++;
            }

        }
    }
    file.close();
    for (int i = ('A'); i <= 'Z'; i++) {
        pctFreq = round((double(freq) / count)*10000) / 100;
        std::cout << i << ": " << std::setw(5) << std::setfill('0') <<
std::fixed << std::setprecision(2) << pctFreq << "%\n";
    }
    return (EXIT_SUCCESS);
}


Reading a text file as binary, is not guaranteed to work. It just happens that binary reading is implemented
as text reading in your platform.

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net


If the encoding is 1-byte long (ASCII, ISO-8859-?, etc), it works. Now
a newer approach taking word alignment in mind.

#include <stdlib.h>
#include <iostream>
#include <string.h>
#include <iomanip>
#include <math.h>
#include <fstream>
#include <ctime>

const int MAX_TEXT_LENGTH = 65536;

int freq[256];
double pctFreq[256];
int count = 0;

inline
void checkRange(char c) {
if ((c >= 'A') && (c <= 'z')) {
if ((c <= 'Z') || (c >= 'a')) {
++freq[c];
count++;
}
}
}

int main(int argc, char** argv) {
memset(freq, 0, sizeof (int) * 256);
memset(pctFreq, 0, sizeof (double) * 256);
unsigned char text[MAX_TEXT_LENGTH];
int diff = 'a' - 'A';
std::ios_base::sync_with_stdio(false);
clock_t time1, time2;
time1 = clock();
std::ifstream file(argv[1], std::ios::in);
//std::ifstream file(argv[1], std::ios::in | std::ios::binary);
do {
file.read((char *) text, MAX_TEXT_LENGTH);
for (std::streamsize i(0), num(file.gcount()); i < num; ++i) {
checkRange(text);
}
} while (!file.eof());
file.close();
for (char i = 'A'; i <= 'Z'; i++) {
pctFreq = round((double(freq + freq[i + diff]) / count)
*10000) / 100;
std::cout << i << ": " << std::setw(5) << std::setfill('0') <<
std::fixed << std::setprecision(2) << pctFreq << "%\n";
}
time2 = clock();
double totalTimeInSeconds = (time2 - time1) / CLOCKS_PER_SEC;
std::cout << "\n\nThe whole process took " << std::fixed <<
totalTimeInSeconds << " seconds.\n";
return (0);
}
 
C

Chris M. Thomasson

Chris M. Thomasson said:
Peter Jansson said:
Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

Well, try this one out too:
_______________________________________________________________________________
#include <stdio.h>
#include <limits.h>


#define BUFFER 65536U


int main(int argc, char** argv) {
size_t i;
unsigned long total = 0;
unsigned long counts[UCHAR_MAX + 1] = { 0 };
FILE* file = fopen(argv[1], "rb");

if (file) {
size_t size;
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {

while (size) {
--size;
++counts[buf[size]];
}

if (size < BUFFER) {
if (feof(file) || ferror(file)) break;
}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

lol. How is the that last if block ever going to executed!
[...]
_______________________________________________________________________________



It just might beat my previous submission by a little bit...




This one actually tries to check for errors:
_______________________________________________________________________________
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <limits.h>


#if ! defined (FILE_MODE)
# define FILE_MODE "rb"
#endif


typedef char sassert[
'Z' - 'A' == 25 &&
'z' - 'a' == 25 ? 1 : -1
];


#define BUFFER 65536U


#define GET_COUNT(mp_self, mp_index) ( \
(mp_self)[(mp_index) + 'a'] + \
(mp_self)[(mp_index) + 'A'] \
)


int main(int argc, char** argv) {
if (argc > 1) {
size_t i;
unsigned long total = 0;
unsigned long counts[UCHAR_MAX + 1] = { 0 };
FILE* file = fopen(argv[1], FILE_MODE);

if (file) {
size_t size;
int status = EXIT_SUCCESS;
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {
size_t tmpsize = size;

while (tmpsize) {
++counts[buf[--tmpsize]];
}

if (size < BUFFER) break;
}

if (ferror(file)) status = EXIT_FAILURE;
if (fclose(file)) status = EXIT_FAILURE;

if (status == EXIT_SUCCESS) {
for (i = 0; i < 26; ++i) {
total += GET_COUNT(counts, i);
}

for (i = 0; i < 26; ++i) {
printf("%c %%%.3f\n", (char)(i + 'a'),
(total) ? ((double)GET_COUNT(counts, i) / total) * 100.0 : 0.0);
}

return EXIT_SUCCESS;
}
}
}

fprintf(stderr, "file error...");

assert(0);

return EXIT_FAILURE;
}
_______________________________________________________________________________




GRRRR!
 
G

giannis t

This is the best I have:

#include <stdlib.h>
#include <iostream>
#include <string.h>
#include <iomanip>
#include <math.h>
#include <fstream>
#include "CStopWatch.hpp"

const int MAX_TEXT_LENGTH = 65536;

int freq[256];
double pctFreq[256];
int count = 0;

int main(int argc, char** argv) {
memset(freq, 0, sizeof (int) * 256);
memset(pctFreq, 0, sizeof (double) * 256);
unsigned char text[MAX_TEXT_LENGTH];
int diff = 'a' - 'A';
std::ios_base::sync_with_stdio(false);
CStopWatch s;
s.startTimer();
//std::ifstream file(argv[1], std::ios::in);
std::ifstream file(argv[1], std::ios::in | std::ios::binary);
do {
file.read((char *) text, MAX_TEXT_LENGTH);
for (std::streamsize i(0), num(file.gcount()); i < num; ++i) {
++freq[text];
}
} while (!file.eof());
file.close();
for (char i = 'A'; i <= 'Z'; i++) {
freq += freq[i + diff];
count += freq;
}
for (char i = 'A'; i <= 'Z'; i++) {
pctFreq = round((double(freq) / count)*10000) / 100;
std::cout << i << ": " << std::setw(5) << std::setfill('0') <<
std::fixed << std::setprecision(2) << pctFreq << "%\n";
}
s.stopTimer();
std::cout << "Time [microseconds]: " << s.getElapsedTime()*1.0E6
<< '\n';
return (EXIT_SUCCESS);
}
 
C

Chris M. Thomasson

Here is my version with some timing code (e.g., `clock()'):
________________________________________________________________
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <assert.h>
#include <limits.h>


#if ! defined (FILE_MODE)
# define FILE_MODE "rb"
#endif


typedef char sassert[
'Z' - 'A' == 25 &&
'z' - 'a' == 25 ? 1 : -1
];


#define BUFFER 65536U


#define GET_COUNT(mp_self, mp_index) ( \
(mp_self)[(mp_index) + 'a'] + \
(mp_self)[(mp_index) + 'A'] \
)


int main(int argc, char** argv) {
if (argc > 1) {
size_t i;
unsigned long total = 0;
unsigned long counts[UCHAR_MAX + 1] = { 0 };
clock_t const start = clock();
FILE* file = fopen(argv[1], FILE_MODE);

if (file) {
size_t size;
int status = EXIT_SUCCESS;
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {
size_t tmpsize = size;

while (tmpsize) {
++counts[buf[--tmpsize]];
}

if (size < BUFFER) break;
}

if (ferror(file)) status = EXIT_FAILURE;
if (fclose(file)) status = EXIT_FAILURE;

if (status == EXIT_SUCCESS) {
clock_t stop;

for (i = 0; i < 26; ++i) {
total += GET_COUNT(counts, i);
}

for (i = 0; i < 26; ++i) {
printf("%c %%%.3f\n", (char)(i + 'a'),
(total) ? ((double)GET_COUNT(counts, i) / total)
* 100.0 : 0.0);
}

stop = clock();

printf("\nelapsed time: %lums\n",
(unsigned long)((((double)stop - (double)start)
/ CLOCKS_PER_SEC) * 1000.0));

return EXIT_SUCCESS;
}
}
}

fprintf(stderr, "file error...");

assert(0);

return EXIT_FAILURE;
}
________________________________________________________________
 
J

James Kanze

Peter Jansson wrote:
[...]
This seems silly. Either you have an array integers and
increment based on the incoming character (read in large
chunks for speed), then later combine the upper- and
lower-case letters, or simply print out a pre-computed list of
frequencies, as the rules state that a known file will be fed
to the program. In the former (non-cheating) case, the
program's run-time will be based almost entirely on I/O speed,
and thus simply a measure of the method used to read the file.
In C++, there are only two ways: iostreams and C-style FILEs.

And any differences in speed between iostreams and FILEs will
depend on the library implementation, not my code.

On the other hand, he did say that the actual test would be run
on Ubantu, a flavor of Linux. So the fastest solution is
probably to use mmap.

But as you say, the test is silly. The fastest code on one
machine will not be the fastest on another. The best solution
is to write the code in the cleanest way possible, then profile
the results.

(Also, the proposal says nothing about the encoding of the input
file. If he's running under Ubantu, the default encoding is
UTF-8, and properly handling case folding for UTF-8 could be
significantly more time consuming than a naïve implementation
which ignores the reality of accented characters.)
 
J

James Kanze

Peter Jansson wrote:
[...]

Here's a simple solution that reads from stdin and doesn't do
any timing (for that, just run with the Unix "time" command).
I see no reason to complicate it with iostreams.

How would iostream represent any additional complication? I
think the code would actually be slightly simpler.

If speed were really a problem, I'd experiment with mmap.
#include <limits.h>
#include <stdio.h>
#include <ctype.h>
int main()
{
size_t f [UCHAR_MAX+1] = { 0 };
size_t size = 0;
int c;

while ( (c = getchar()) != EOF )

With iostream, these two lines would be:

char c ;
while ( std::cin.get( c ) )

That looks even simpler than what you've written.
 
J

Juha Nieminen

James said:
The fastest code on one
machine will not be the fastest on another. The best solution
is to write the code in the cleanest way possible, then profile
the results.

I have yet to see a compiler/system where C++ streams are actually
faster then C streams. Thus it's pretty safe to say that if you want
good I/O speed, using C streams will give you an edge in most, if not
all systems. Even if there was a system where C++ streams are as fast as
C streams, you wouldn't be losing anything in speed even there.

(Of course a different issue is that with C streams your code will
become more complicated and error-prone.)
 
C

Chris M. Thomasson

James Kanze said:
(Also, the proposal says nothing about the encoding of the input
file. If he's running under Ubantu, the default encoding is
UTF-8, and properly handling case folding for UTF-8 could be
significantly more time consuming than a naïve implementation
which ignores the reality of accented characters.)

Well, then perhaps one can attempt to create a local table and map in
unsigned shorts/UNICODE values directly, or whatever as concrete indexes
into said table? I assume I can create sufficient sized table to map some
UNICODE character values directly into indexes of a local counter array.
Well, perhaps of course...

:^)
 
C

Chris M. Thomasson

Juha Nieminen said:
James said:
The fastest code on one
machine will not be the fastest on another. The best solution
is to write the code in the cleanest way possible, then profile
the results.

[...]

(Of course a different issue is that with C streams your code will
become more complicated and error-prone.)

Perhaps... ;^o


Yikes!!!!!!!!!!!!
 
C

Chris M. Thomasson

Chris M. Thomasson said:
James Kanze said:
(Also, the proposal says nothing about the encoding of the input
file. If he's running under Ubantu, the default encoding is
UTF-8, and properly handling case folding for UTF-8 could be
significantly more time consuming than a naïve implementation
which ignores the reality of accented characters.)

Well, then perhaps one can attempt to create a local table and map in
unsigned shorts/UNICODE values directly, or whatever as concrete indexes
into said table? I assume I can create sufficient sized table to map some
UNICODE character values directly into indexes of a local counter array.
Well, perhaps of course...

Assume NO hash collisions...
 
I

Ioannis Vranos

Chris said:
Ioannis Vranos said:
Actually, just changing the line :
ifstream inputFile(argv[argc -1]);
with
ifstream inputFile(argv[argc -1],std::ios::binary);

beats any other post (on my PC at least :))

Here's the new code (just added the binary open mode)


Binary mode isn't the same as text mode (it will not work) in all
platforms.

I only care about the platforms in which it does work. ;^)


OK, I have no problem if you can't provide a better solution, after all it is just a challenge. When I say it
will not work, I mean you will get inaccurate results, and there is a chance that the program already does not
work in your platform, that is there is a chance you are already getting inacaccurate results.


Second, are you the same person with the poster I replied ([email protected])?


--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
I

Ioannis Vranos

Peter said:
Dear news group,

I have now included all programs submitted so far in the challenge. 10
programs up to now, with 9 of them written in this group thread. You
can view the most recent results here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html


Sincerely,
Peter Jansson


There is an error in your blog. Code named "Ioannis A. Vranos, Code 3 (C++): 2 s" isn't mine.


Also the "Ioannis A. Vranos, Code 4 (C++): 1 s"

is my original code improved (you may say corrected),


so you can erase the original code "Ioannis A. Vranos, Code 2 (C++): 13 s" and "Ioannis A. Vranos, Code 1
(C++): 13 s" (both of which were essentially the same and thus were in the same message).



--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
I

Ioannis Vranos

Corrected:

==> There is an error in your blog. Code named "Ioannis A. Vranos, Code 4
==> (C++): 2 s" isn't mine.==> Also the "Ioannis A. Vranos, Code 3 (C++): 2 s"==> is my original code improved (you may say corrected),
so you can erase the original code "Ioannis A. Vranos, Code 2 (C++): 13
s" and "Ioannis A. Vranos, Code 1 (C++): 13 s" (both of which were
essentially the same and thus were in the same message).


--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,159
Messages
2,570,886
Members
47,419
Latest member
ArturoBres

Latest Threads

Top