C++ programming challenge

C

Chris M. Thomasson

Ioannis Vranos said:
Chris said:
Ioannis Vranos said:
(e-mail address removed) wrote:

Actually, just changing the line :
ifstream inputFile(argv[argc -1]);
with
ifstream inputFile(argv[argc -1],std::ios::binary);

beats any other post (on my PC at least :))

Here's the new code (just added the binary open mode)


Binary mode isn't the same as text mode (it will not work) in all
platforms.

I only care about the platforms in which it does work. ;^)


OK, I have no problem if you can't provide a better solution, after all it
is just a challenge. When I say it will not work, I mean you will get
inaccurate results, and there is a chance that the program already does
not work in your platform, that is there is a chance you are already
getting inacaccurate results.

I only care about the platforms in which it works.



Second, are you the same person with the poster I replied
([email protected])?

No. I post with email address of ([email protected]).
 
C

Chris M. Thomasson

Ioannis Vranos said:
Chris said:
Ioannis Vranos said:
(e-mail address removed) wrote:

Actually, just changing the line :
ifstream inputFile(argv[argc -1]);
with
ifstream inputFile(argv[argc -1],std::ios::binary);

beats any other post (on my PC at least :))

Here's the new code (just added the binary open mode)


Binary mode isn't the same as text mode (it will not work) in all
platforms.

I only care about the platforms in which it does work. ;^)


OK, I have no problem if you can't provide a better solution, after all it
is just a challenge. When I say it will not work, I mean you will get
inaccurate results, and there is a chance that the program already does
not work in your platform, that is there is a chance you are already
getting inacaccurate results.

I only care about the platforms in which it works.



Second, are you the same person with the poster I replied
([email protected])?

No. I post with email address of ([email protected]).

What makes you think that? I post a lot over in `comp.programming.threads'.
 
I

Ioannis Vranos

Chris said:
Ioannis Vranos said:
Chris said:
(e-mail address removed) wrote:

Actually, just changing the line :
ifstream inputFile(argv[argc -1]);
with
ifstream inputFile(argv[argc -1],std::ios::binary);

beats any other post (on my PC at least :))

Here's the new code (just added the binary open mode)


Binary mode isn't the same as text mode (it will not work) in all
platforms.

I only care about the platforms in which it does work. ;^)


OK, I have no problem if you can't provide a better solution, after
all it is just a challenge. When I say it will not work, I mean you
will get inaccurate results, and there is a chance that the program
already does not work in your platform, that is there is a chance you
are already getting inacaccurate results.

I only care about the platforms in which it works.



Second, are you the same person with the poster I replied
([email protected])?

No. I post with email address of ([email protected]).

What makes you think that? I post a lot over in `comp.programming.threads'.


Because you replied to a message of mine addressed to (e-mail address removed), as you can also see from
the quotations.


--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
A

Alf P. Steinbach

* Peter Jansson:
My mistake, have fixed it.

I made 2 small changes to Ioannis' program.

It sort of sped up... ;-)


<code>
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <assert.h>
#include <limits.h>

#if ! defined (FILE_MODE)
# define FILE_MODE "rb"
#endif

typedef char sassert[
'Z' - 'A' == 25 &&
'z' - 'a' == 25 ? 1 : -1
];

//#define BUFFER 65536U
#define BUFFER 1025*1024U // !Alf

#define GET_COUNT(mp_self, mp_index) ( \
(mp_self)[(mp_index) + 'a'] + \
(mp_self)[(mp_index) + 'A'] \
)

int main(int argc, char** argv) {
if (argc > 1) {
size_t i;
unsigned long total = 0;
unsigned long counts[UCHAR_MAX + 1] = { 0 };
clock_t const start = clock();
FILE* file = fopen(argv[1], FILE_MODE);
setvbuf( file, 0, _IONBF, 0 ); // !Alf

if (file) {
size_t size;
int status = EXIT_SUCCESS;
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {
size_t tmpsize = size;

while (tmpsize) {
++counts[buf[--tmpsize]];
}

if (size < BUFFER) break;
}

if (ferror(file)) status = EXIT_FAILURE;
if (fclose(file)) status = EXIT_FAILURE;

if (status == EXIT_SUCCESS) {
clock_t stop;

for (i = 0; i < 26; ++i) {
total += GET_COUNT(counts, i);
}

for (i = 0; i < 26; ++i) {
printf("%c %%%.3f\n", (char)(i + 'a'),
(total) ? ((double)GET_COUNT(counts, i) / total)
* 100.0 : 0.0);
}

stop = clock();

printf("\nelapsed time: %lums\n",
(unsigned long)((((double)stop - (double)start)
/ CLOCKS_PER_SEC) * 1000.0));

return EXIT_SUCCESS;
}
}
}

fprintf(stderr, "file error...");

assert(0);

return EXIT_FAILURE;
}
</code>



Cheers & hth.,

- Alf

PS: It's possible to optimize even more but what's the point?
 
A

Alf P. Steinbach

* Ioannis Vranos:
Either you are joking which is OK, or you made a mistake because this
has not any relevance with my code.

Oh, sorry, perhaps it was someone else's code.

Anyways, it was the code at top at the website, or perhaps second: I aimed my
mouse at "Ioannis" there but may have missed.

The modified code is about four hundred times faster, near as I can tell. ;-)


Cheers & hth.,

- Alf
 
J

James Kanze

news:a6dd6d1e-2ebd-4d93-8495-7945ecc3ad59@s16g2000vbp.googlegroups.com...
[...]
(Also, the proposal says nothing about the encoding of the
input file. If he's running under Ubantu, the default
encoding is UTF-8, and properly handling case folding for
UTF-8 could be significantly more time consuming than a
naïve implementation which ignores the reality of accented
characters.)
Well, then perhaps one can attempt to create a local table and
map in unsigned shorts/UNICODE values directly, or whatever as
concrete indexes into said table?

How to fold UTF-8 input rapidly would be the challenge. There
are over a million code points (and UTF-8 actually supports over
4 billion), so a Unicode character doesn't fit into a short on
most machines, and the tables would be very, very big. And very
sparce, since very few characters (proprotionally) actually have
a case difference.
I assume I can create sufficient sized table to map some
UNICODE character values directly into indexes of a local
counter array. Well, perhaps of course...

The approach I've taken, in such cases, is to use a tri,
starting with the first byte, and progressing byte by byte.
Whether this is optimal or not, I don't know; it's probably
optimal for the case where encodings greater than 127 are rare,
which is my case, and less optimal otherwise. The more
classical implementation, I think, is to convert the UTF-8 to
UTF-32, and use a two stage table (with default values, since in
many cases, there can be a good default value covering most
entries). In both cases, implementing something like islower or
toupper can require quite a bit of memory, however, even if only
a small subset of the character codes are involved. (None of
the larger scripts, like Chinese, have case.)
 
J

James Kanze

On Jun 11, 3:20 am, Ioannis Vranos <[email protected]> wrote:

Not to pick on your code particularly, but in general, in the
context of the challenge, what is done to ensure that the code
is legal and portable C++? That it works correctly with
multibyte characters, or EBCDIC, or with all of the more exotic
implementations?

It may be perfectly valid to not require all of this, but if so,
the acceptable restrictions should be stated up front.
If the encoding is 1-byte long (ASCII, ISO-8859-?, etc), it
works.

He's testing on a Linux machine. The default encoding under
Linux is UTF-8, a multibyte encoding. Unless otherwise stated,
it seems safe to assume that code which doesn't handle multibyte
encodings correctly is incorrect.
Now a newer approach taking word alignment in mind.
const int MAX_TEXT_LENGTH = 65536;

Either the name is a lie, or the constraint is unreasonable.
Files of several Gigabytes are frequent today, and I've seen
some which are in the Terabyte range.
int freq[256];

This is actually a count. And on most implementations, an int
will overflow long before you reach the end of a large file.
double pctFreq[256];
int count = 0;
inline
void checkRange(char c) {
if ((c >= 'A') && (c <= 'z')) {

Always false with EBCDIC ('A' is 193, and 'z' 169).
if ((c <= 'Z') || (c >= 'a')) {
++freq[c];
count++;
}
}
}
int main(int argc, char** argv) {
memset(freq, 0, sizeof (int) * 256);
memset(pctFreq, 0, sizeof (double) * 256);

And formally, this isn't guaranteed to set the values in pctFreq
to 0.0. This is actually what got me wondering to begin with:
reading a double initialized in this way is formally undefined
behavior, so your program is doesn't work. Except that I've
never heard of an implementation where it won't actually work.
If the requirement is "portable C++", then he must provide some
sort of filter which will detect this error, and reject your
program because of it. If the requirement isn't portable C++,
of course, then on Ubantu, I'd use mmap to read the file. And
if he's willing to restrict the input to the single byte
encodings of UTF-8, something like blargg's suggestion, combined
with mmap, should be close to unbeatable. Except that if he has
a multicore machine, it might be worth creating some additional
threads (or even forking new processes), each which handles a
contiguous block of data in the file, and then merging the
results. (Using different processes would even allow using mmap
for files which don't fit into memory.)

Until we know exactly what the requirements are, it's impossible
to write a program and know whether it fulfills these
requirements.

(FWIW: with any reasonably good implementation, I would expect
using std::fill to be faster than memset. As well as working
for all types, and not just for integral types.)
unsigned char text[MAX_TEXT_LENGTH];
int diff = 'a' - 'A';
std::ios_base::sync_with_stdio(false);
clock_t time1, time2;
time1 = clock();
std::ifstream file(argv[1], std::ios::in);
//std::ifstream file(argv[1], std::ios::in | std::ios::binary);
do {
file.read((char *) text, MAX_TEXT_LENGTH);
for (std::streamsize i(0), num(file.gcount()); i < num; ++i) {
checkRange(text);


Why not just:
++ freq[ text[ i ] ] ;
here, and do the case folding once, before the output.
(Blargg's suggestion, but the obvious "rapid" solution.)
}
} while (!file.eof());

Almost. (I'm actually impressed that someone got this close to
correct.) If there's a hardware read error, you've got an
endless loop. Just use "while ( file )" as the condition.
file.close();
for (char i = 'A'; i <= 'Z'; i++) {
pctFreq = round((double(freq + freq[i + diff]) / count)
*10000) / 100;
std::cout << i << ": " << std::setw(5) << std::setfill('0') <<
std::fixed << std::setprecision(2) << pctFreq << "%\n";
}
time2 = clock();
double totalTimeInSeconds = (time2 - time1) / CLOCKS_PER_SEC;
std::cout << "\n\nThe whole process took " << std::fixed <<
totalTimeInSeconds << " seconds.\n";
return (0);
}
 
T

Thomas J. Gritzan

Since this is comp.lang.c++, I will post a high-level C++ solution.

What's the high-level about this solution?
do
{
inputFile.read(buffer, sizeof(buffer));


for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer ];

}while(not inputFile.eof());


Using istream::eof or ::feof() as loop-condition is always wrong.
It might be that the program counts the last block twice, or that it
will be stuck in an infinite loop when the read fails:

http://www.parashift.com/c++-faq-lite/input-output.html#faq-15.5
 
I

Ioannis Vranos

Thomas said:
What's the high-level about this solution?


That it uses high-level standard library facilities (std::string, std::vector or valarray, std::ifstream,
etc), instead of practically reimplementing existing high-level standard library functiuonality in the code,
using low-level facilities like built-in arrays, getchar(), UCHAR_MAX, etc.


do
{
inputFile.read(buffer, sizeof(buffer));


for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer ];

}while(not inputFile.eof());


Using istream::eof or ::feof() as loop-condition is always wrong.
It might be that the program counts the last block twice, or that it
will be stuck in an infinite loop when the read fails:

http://www.parashift.com/c++-faq-lite/input-output.html#faq-15.5



About the above code:

The worst case is when a reading reads the last character, and the next reading reads no characters. In this
case, gcount() will be zero, so there will be no side-efrects.


The link you provided talks about a while construct, and not a do-while construct. In the do-while case, the
EOF will have already been reached.


That is, the FAQ mentions:


---------- From the FAQ ----------


For example, the following code might have an off-by-one error with the count i:

int i = 0;
while (! std::cin.eof()) { // WRONG! (not reliable)
std::cin >> x;
++i;
// Work with x ...
}

What you really need is:

int i = 0;
while (std::cin >> x) { // RIGHT! (reliable)
++i;
// Work with x ...
}

----------------------------------



If you convert it to do-while the fist becomes:


int i = 0;

do
{
std::cin >> x;
++i;
// Work with x ...
}while (! std::cin.eof());


which catches the EOF, since the reading takes place before the condition.



--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
I

Ioannis Vranos

blargg said:
:
[...]
#define GET_COUNT(mp_self, mp_index) ( \
(mp_self)[(mp_index) + 'a'] + \
(mp_self)[(mp_index) + 'A'] \
)
[...]

Why do you keep posting non-portable programs that assume ASCII?
toupper/tolower are your friends; use them! If you think they pose a
performance problem, you're doing the algorithm wrong, as you should only
combine lower and uppercase counts AFTER you've counted frequencies, not
during.


As a hint to Chris M. Thomasson:


We know that the basic character set is guaranteed to be implemented in the range [0, 127].


Counting all the chars read in the style


// The read characters are stored in the char array buffer.
char buffer[BUFSIZ]= {};

vector<unsigned long> charFrequencies(128);


// We read and store the characters in the char array buffer.
//
// sizeof(buffer)== BUFSIZ.
//
for(size_t i= 0; i< sizeof(buffer); ++i)
charFrequencies[ buffer ]++;


will work for all characters independently of the implementation of the *basic character set* (ASCII, EBCDIC,
whatever).





--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
T

Thomas J. Gritzan

Ioannis said:
Thomas said:
Ioannis said:
do
{
inputFile.read(buffer, sizeof(buffer));


for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer ];

}while(not inputFile.eof());


By the way, this "not" instead of "!" is horrible. I would write:

while (!inputFile.eof());

Which is still wrong because of eof. Correct is:

while (inputFile);
About the above code:

The worst case is when a reading reads the last character, and the next
reading reads no characters. In this case, gcount() will be zero, so
there will be no side-efrects.

No. The worst case is when the stream sets failbit or badbit, because of
a hardware or read error, as James mentions elsethread.
In this case, the loop will never stop.
The link you provided talks about a while construct, and not a do-while
construct. In the do-while case, the EOF will have already been reached. [...]
If you convert it to do-while the fist becomes:


int i = 0;

do
{
std::cin >> x;
++i;
// Work with x ...
}while (! std::cin.eof());

Assuming x equals i... (should we tell the author?)
which catches the EOF, since the reading takes place before the condition.

If the read fails because of eof, x (or i) will contain garbage (like
the previous value) and this value will be evaluated (again).

If the input stream doesn't contain a number, the read fails and failbit
is set but not eof, so the loop will go infinitly.

So the rule is: Never use eof as loop condition.
 
A

Andrew Tomazos

How to fold UTF-8 input rapidly would be the challenge.  There
are over a million code points (and UTF-8 actually supports over
4 billion), so a Unicode character doesn't fit into a short on
most machines, and the tables would be very, very big.  And very
sparce, since very few characters (proprotionally) actually have
a case difference.

UTF-8 only supports about 1 million code points. 5 and 6 byte
sequences are explicitly illegal.
-Andrew.
 
A

Andrew Tomazos

He's testing on a Linux machine.  The default encoding under
Linux is UTF-8, a multibyte encoding.  Unless otherwise stated,
it seems safe to assume that code which doesn't handle multibyte
encodings correctly is incorrect.

Who defined that the "default encoding under Linux is UTF-8" ? In
what standard is this specified? Or where did you hear it?
-Andrew.
 
A

Alf P. Steinbach

* Peter Jansson:
Alf, I have included your modified program as well in the comparison. It
is now on the weblog page.

Uhm, I'm sorry to report that I was lead astray by two Windows "features":

1. In Windows, 'clock', used in that program to measure the time, reports
wall time, not process time.

2. For some obscure reason Windows XP, or perhaps it's the anti-virus, or
whatever, causes a process to run extra slowly the first few times. In
this case I ran the original program twice with a certain medium size
text file, then with modification it ran 400 times faster!, which I could
not believe was an arbitrary effect. But now testing it, and with a much
larger file, it is as you reported, about the same time.

Sincerely,
Peter Jansson

PS: How would you optimize it even more?

On my machine, Windows XP,

#define BUFFER 16*1024U

for the same program text that you have, yields the best times (statistically).

You might want to put that also in your list. :)

I don't have a bittorrent client so couldn't obtain your file, but with a half
gigabyte text file I constructed from a dictionary text, the processing is about
1/27 of the i/o time. So a marginal improvement, on that order, could be to read
e.g. four bytes (a long) at a time from the buffer to avoid memory accesses.

Just to be thorough I tested using the lower level Posix standard open etc., but
it turned out that it was not significantly faster for this than FILE*.

A less insignificant improvement could be to use a memory mapped file, because
that could conceivably leverage some OS virtual memory optimization. However,
while the Boost library does have some functionality for mapping a file to
memory, it beats me how to use that. It's not exactly well-documented.


Cheers & hth.,

- Alf (victim of Windows (the Eagles' newest greatest hit!))
 
I

Ioannis Vranos

Thomas said:
Ioannis said:
Thomas said:
Ioannis Vranos schrieb:
do
{
inputFile.read(buffer, sizeof(buffer));


for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer ];

}while(not inputFile.eof());


By the way, this "not" instead of "!" is horrible. I would write:



Horrible? "not" is a *built in* keyword, as "and" and "or". They are more user readable.





No. The worst case is when the stream sets failbit or badbit, because of
a hardware or read error, as James mentions elsethread.
In this case, the loop will never stop.


My code doesn't take into account hardware failures. Also if I am not wrong, while(inputFile) does not check
for EOF, neither while(cin>> something) that the FAQ uses.




--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
C

Chris M. Thomasson

Alf P. Steinbach said:
* Ioannis Vranos:

Oh, sorry, perhaps it was someone else's code.

Its my code that I posted here:

http://groups.google.com/group/comp.lang.c++/msg/2f6fd68d71c477aa



Anyways, it was the code at top at the website, or perhaps second: I aimed
my mouse at "Ioannis" there but may have missed.

The modified code is about four hundred times faster, near as I can tell.
;-)

Four hundred times faster? Well, unfortunately I am not seeing the benefits
on my platform (Old P4 HyperThreaded 3.06gz 512mb ram). It actually goes
slower.
 
C

Chris M. Thomasson

blargg said:
Ioannis Vranos wrote:
[...]
We know that the basic character set is guaranteed to be implemented in
the range [0, 127]. [...]
char buffer[BUFSIZ]= {};
vector<unsigned long> charFrequencies(128); [...]
charFrequencies[ buffer ]++;

will work for all characters independently of the implementation of the
*basic character set* (ASCII, EBCDIC, whatever).


And we can use UCHAR_MAX to ensure we cover ALL char values (cast char to
unsigned char first, of course). No need for any magic constants.

(BTW, your lines are longer than 80 characters, in case you didn't
realize)


I felt no need to cover anything other than plain ol ASCII. Sorry.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,159
Messages
2,570,885
Members
47,419
Latest member
ArturoBres

Latest Threads

Top