[ ... ]
I think it's a good example of the sort of thing accumulate
should be usable for. Note, however, that accumulate will
normally copy your counter twice for each element, which isn't
going to help performance much. In the past, I've used a
somewhat hacky solution to avoid this:
Right -- as you've seen, I used a truly crufty hack to avoid
it, which worked in this case (at least with the compiler I
tested on -- I'd have to do a bit to work to be sure whether
it's really guaranteed to work).
I think it is, although I don't know if it is intentional. The
standard says very clearly that the function must execute acc =
acc + *i. This may be accidental overspecification, rather than
an explicit intent to allow "funny" operator+ (and operator=, of
course---note that your version still depends on std::vector
recognizing self-asignment, and not doing anything in that
case).
I'll admit that in this case, I rather prefer my solution, with
an explicit binary operator (rather than unexpected semantics
for the + operator), returning a special type, for which a no-op
assignment operator is defined. Both types (the operator and
its return type) having names which are suggestive enough to
discourage the user using them elsewhere. (In addition, in my
case, the class was carefully designed to contain only very
simple PODs, so that the default copy constructor and copy
assignment operator are appropriate. With the hope that the
compiler knows how to implement these better than I could.)
My comment about this being an abuse of accumulate wasn't
about the basic idea of using accumulate to um...accumulate
the statistics -- to me, that seemed perfectly
straightforward. It was about the contortion of having
operator+ return a reference to its input to reduce copying.
Yes. I've often wondered why std::accumulate uses +, and not
+=. Or more interestingly, use meta-programming so that it uses
+= if it exists, + otherwise, and simply "binary_op(acc, *i)",
if binary_op takes a non-const reference for the first argument,
and the version with assignment otherwise.
Regretfully, I think it's too late to fix it now.
[ ... ]
It will work for EBCDIC, but it won't work for ISO 8859-1
(which is probably the most widespread single byte
encoding).
Right -- as it stands, it assumes that all lower case letters
fall in the range 'a'..'z', which is true for ASCII and
EBCDIC, but not much of anything else. Of course, if you
wanted to make it work for ISO 8859, you'd just use UCHAR_MAX
instead, and use (or cast to) unsigned char instead of char in
a couple of places. You'd only be using a fairly small
fraction of the allocated memory, but for typical values of
UCHAR_MAX (e.g., 255) it's not a big deal, at least a long as
you avoid copying much.
After more thought, however, I suspect this might work better
for Unicode/ISO 10646 than I'd previously assumed, at least on
most typical systems. The problem is that the array usage is
quite sparse. As long as we're careful, however, we can get
the virtual memory manager to handle the sparseness for us.
Supposing we have enough memory
.
As long as we never "touch" -- not even intialize -- the parts
of the array that don't represent alphabetic characters, the
virtual memory manager won't (at least in a typical case)
allocate any physical memory for that address. Even though the
alphabetic characters aren't all arranged in one place, I
believe they are arranged into clusters -- which should work
pretty well for a virtual memory manager.
I think I might have to write a bit of test code, to figure
out how much memory this really uses with something like UCS-4
(which I think would be about the worst case).
There are two approachs, I think. I have a template class,
StateTable (a poor name, but that's what I first used it for),
which maps UTF-8 to a StateType, using a tri internally; it
allocates lazily, so sequences never seen are never allocated.
The other alternative, of course, is to convert on the fly to
UTF-32, and index directly. That means a table of roughly 8 MB
on most machines (8 bytes per counter), but as you say, there
will be large blocks which will never be accessed. (Outside of
the first 256 characters, blocks with letters are fairly
distinct from those with other types of characters.) If you
declare it static, the system should take care of initializing
it to 0; I don't know how different systems do this, however
(whether they touch the pages or not). Another frequent
solution is a two level table, a table of pointers indexed by
the top bits, to tables of counters indexed by the bottom bits;
if you use lazy allocation for the second level, you should also
save significant memory.
With any of these solutions, you'll also need a mapping table
for case folding. This shouldn't be that big, however, since
most alphabets (and particularly the big ones) don't have case.
(My StateTable for simple lower case mapping only requires 37
data blocks, of 64 entries each.)