On Jun 11, 3:20 am, Ioannis Vranos <
[email protected]> wrote:
Not to pick on your code particularly, but in general, in the
context of the challenge, what is done to ensure that the code
is legal and portable C++? That it works correctly with
multibyte characters, or EBCDIC, or with all of the more exotic
implementations?
It may be perfectly valid to not require all of this, but if so,
the acceptable restrictions should be stated up front.
If the encoding is 1-byte long (ASCII, ISO-8859-?, etc), it
works.
He's testing on a Linux machine. The default encoding under
Linux is UTF-8, a multibyte encoding. Unless otherwise stated,
it seems safe to assume that code which doesn't handle multibyte
encodings correctly is incorrect.
Now a newer approach taking word alignment in mind.
const int MAX_TEXT_LENGTH = 65536;
Either the name is a lie, or the constraint is unreasonable.
Files of several Gigabytes are frequent today, and I've seen
some which are in the Terabyte range.
This is actually a count. And on most implementations, an int
will overflow long before you reach the end of a large file.
double pctFreq[256];
int count = 0;
inline
void checkRange(char c) {
if ((c >= 'A') && (c <= 'z')) {
Always false with EBCDIC ('A' is 193, and 'z' 169).
if ((c <= 'Z') || (c >= 'a')) {
++freq[c];
count++;
}
}
}
int main(int argc, char** argv) {
memset(freq, 0, sizeof (int) * 256);
memset(pctFreq, 0, sizeof (double) * 256);
And formally, this isn't guaranteed to set the values in pctFreq
to 0.0. This is actually what got me wondering to begin with:
reading a double initialized in this way is formally undefined
behavior, so your program is doesn't work. Except that I've
never heard of an implementation where it won't actually work.
If the requirement is "portable C++", then he must provide some
sort of filter which will detect this error, and reject your
program because of it. If the requirement isn't portable C++,
of course, then on Ubantu, I'd use mmap to read the file. And
if he's willing to restrict the input to the single byte
encodings of UTF-8, something like blargg's suggestion, combined
with mmap, should be close to unbeatable. Except that if he has
a multicore machine, it might be worth creating some additional
threads (or even forking new processes), each which handles a
contiguous block of data in the file, and then merging the
results. (Using different processes would even allow using mmap
for files which don't fit into memory.)
Until we know exactly what the requirements are, it's impossible
to write a program and know whether it fulfills these
requirements.
(FWIW: with any reasonably good implementation, I would expect
using std::fill to be faster than memset. As well as working
for all types, and not just for integral types.)
unsigned char text[MAX_TEXT_LENGTH];
int diff = 'a' - 'A';
std::ios_base::sync_with_stdio(false);
clock_t time1, time2;
time1 = clock();
std::ifstream file(argv[1], std::ios::in);
//std::ifstream file(argv[1], std::ios::in | std::ios::binary);
do {
file.read((char *) text, MAX_TEXT_LENGTH);
for (std::streamsize i(0), num(file.gcount()); i < num; ++i) {
checkRange(text);
Why not just:
++ freq[ text[ i ] ] ;
here, and do the case folding once, before the output.
(Blargg's suggestion, but the obvious "rapid" solution.)
Almost. (I'm actually impressed that someone got this close to
correct.) If there's a hardware read error, you've got an
endless loop. Just use "while ( file )" as the condition.
file.close();
for (char i = 'A'; i <= 'Z'; i++) {
pctFreq = round((double(freq + freq[i + diff]) / count)
*10000) / 100;
std::cout << i << ": " << std::setw(5) << std::setfill('0') <<
std::fixed << std::setprecision(2) << pctFreq << "%\n";
}
time2 = clock();
double totalTimeInSeconds = (time2 - time1) / CLOCKS_PER_SEC;
std::cout << "\n\nThe whole process took " << std::fixed <<
totalTimeInSeconds << " seconds.\n";
return (0);
}