K
Keith H Duggar
Dear news group,
I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:
http://blog.p-jansson.com/2009/06/programming-challenge-letter-freque...
The JT solution has a bug causing it to count only 70294 of the
287924224 total characters in the file. The following statement
r=fread(buffer,BUFSIZE,1,fp))
should be
r=fread(buffer,1,BUFSIZE,fp))
otherwise r is always 1 (except for the last read) causing the
inner for loop to examine only the 1st character of each block.
That is why the JT solution is so much faster than the others.
It is much slower when this error is corrected.
Here is actual counts (rather than %) for the bugged and fixed
JT code (where * are totals):
bugged fixed
* 70294 * 287924224
a 3834 a 15704064
b 644 b 2637824
c 2330 c 9543680
d 1838 d 7528448
e 6452 e 26427392
f 1418 f 5808128
g 1050 g 4300800
h 2118 h 8675328
i 4332 i 17743872
j 56 j 229376
k 354 k 1449984
l 1882 l 7708672
m 1312 m 5373952
n 3804 n 15581184
o 5198 o 21291008
p 1552 p 6356992
q 70 q 286720
r 4358 r 17850368
s 3360 s 13762560
t 4888 t 20021248
u 1648 u 6750208
v 654 v 2678784
w 830 w 3399680
x 112 x 458752
y 1292 y 5292032
z 22 z 90112
Finally the format string in
printf("%c %2.3f%%\n", i, (float)(100*counts) / (float)total );
is bugged. It should be
printf("%c %7.3f%%\n", i, (float)(100*counts) / (float)total );
ie %7.3f not %2.3f since floating point field width is the total
width not the width of the portion left of the decimal point. So
the JT timing results should be removed until fixed; and outputs
should be compared against a known correct output showing giving
the actual counts rather than normalized percentages.
KHD