Good ole gnu::hash_map, I'm impressed

M

Mirco Wahab

Hi,

recently I tried to mimic a simple word frequency
counter in C++ that uses hashing.

Assumed we have a somehow big text file (14 MB) that
contains >93,000 *different* words. To get word
count and frequencies, one would use:

[Perl, don't boggle]


my $fn = 'fulltext.txt';

print "start slurping\n";
open my $fh, '<', $fn or die "$fn - $!";
my $data; { local $/; $data = <$fh> }

print "start hashing\n";
my %hash;
++$hash{$1} while $data=~/(\w\w*)/g;

print "done, $fn (" . int(length($data)/1024)
. " KB) has " . (scalar keys %hash) . " different words\n"



This is just the "first part". Build a hash with
all different words (simplified) and count their
frequencies at first.

This runs through on a test machine (P4/2666MHz, Linux, Perl 5.8.8)
in ~4s (real). I'd never expect a "complicated looking" C++
solution to beat that one.

After some fiddling with the gnu::hash_map (which works
also as a drop-in replacement for std::map), I'm stunned:

[File size: 14MB; total words: 2,335,569; different words: 93,405]
std::map C++ implementation - 0m5.739s real
perl %hash implementation - 0m3.981s real
gnu::hash_map C++ implementation - 0m3.597s real ***

(I did three runs for each and took the best one.)

The hash_map implementation did work on gcc from 3.4.4
through 4.3.2 on Linux and Windows (Cygwin).

Q1: Does anybody else (besides me) like to "hash something"?
How do you do that? Boost (I didn't get this working
*without* their (experimental) /tr1 branch on Cygwin
and older Linuxes).

Q2: Which "future" can be expected regarding "hashing"?

Thanks & Regards

Mirco

PS.: I'll add the gnu::hash_map variant below. In
order to use std::map here, remove the "namespace hack"
on the beginning and modify the typedef. Thats it.


==>


#include <boost/regex.hpp>
#include <ext/hash_map> // Gnu gcc specific, switch to <map>
#include <iostream>
#include <fstream>
#include <string>

// allow the gnu hash_map to work on std::string
namespace __gnu_cxx {
template<> struct hash< std::string > {
size_t operator()(const std::string& s) const {
return hash< const char* >()( s.c_str() );
}
}; /* gcc.gnu.org/ml/libstdc++/2002-04/msg00107.html */
}

// this is what we would love:
typedef __gnu_cxx::hash_map<std::string, int> Hash; // change to std::map

char *slurp(const char *fname, size_t* len);
int word_freq(const char *block, size_t len, Hash& hash);

int main()
{
using namespace std;
size_t len;

const char *fn = "fulltext.txt";
cout << "start slurping" << endl;
char *block = slurp(fn, &len);

Hash hash;
cout << "start hashing" << endl;
int n = word_freq(block, len, hash);
delete [] block;

cout << "done, " << fn << " (" << len/1024
<< "KB) has " << n << " different words" << endl;

return 0;
}

char *slurp(const char *fname, size_t* len)
{
std::ifstream fh(fname); // open
fh.seekg(0, std::ios::end); // get to EOF
*len = fh.tellg(); // read file pointer
fh.seekg(0, std::ios::beg); // back to pos 0
char* data = new char [*len+1];
fh.read(data, *len); // slurp the file
return data;
}

int word_freq(const char *block, size_t len, Hash& hash)
{
using namespace boost;
match_flag_type flags = match_default;
static regex r("\\w\\w*");
cmatch match;

const char *from=block, *to=block+len;
while( regex_search(from, to, match, r, flags) ) {
hash[ std::string(match[0].first, match[0].second) ]++;
from = match[0].second;
}
return hash.size();
}


<==
 
J

James Kanze

Q1: Does anybody else (besides me) like to "hash something"?
How do you do that?

It depends. You might like to have a look at my "Hashing.hh"
header (in the code at kanze.james.neuf.fr/code-en.html---the
Hashing component is in the Basic section). Or for a discussion
and some benchmarks,
http://kanze.james.neuf.fr/code/Docs/html/Hashcode.html. (That
article is a little out of date now, as I've tried quite a few
more hashing algorithms. But the final conclusions still hold,
more or less.)
Q2: Which "future" can be expected regarding "hashing"?

There will be an std::unordered_set and std::unordered_map in
the next version of the standard, implemented using hash tables,
and there will be standard hash functions for most of the common
types. (I wonder, however. Is the quality of the hashing
function going to be guaranteed?)
 
M

Mirco Wahab

Alf said:
Boost only provides a partial implementation of the TR1 hash
functionality, namely the pure hashing functions, <url:
http://www.boost.org/doc/libs/1_35_0/doc/html/hash.html>.

Too bad this doesn't really work on my platforms. According
to the docs, boost's hash should kick into existing hash_map/_set
if the underlying library provides it. This doesn't work OOTB,
afaik, in Visual Studio 9 (has a stdext::hash_map) and in
gcc 4.3.2 (has a __gnu_cxx::hash_map/_set). So it's somehow
unusable (at least for me).
For hash table implementation it relies on non-standard Boost
multi-index container library, <url:
http://www.boost.org/doc/libs/1_35_0/libs/multi_index/doc/index.html>.

I don't know whether that can be wrapped in some way to resemble the TR1
classes.

I tried this yesterday (for almost a hour) and throwed the
towel. I couldn't get this to work with my problem in
the original posting.

Thanks & Regards.

M.

PS.:
my working solution for now has the following prologue
on top of the source file:

...
#include <hash_map>
#if defined (_MSC_VER)
typedef stdext::hash_map<std::string, int> Hash;
#else
// allow the gnu hash_map to work on std::string
namespace __gnu_cxx {
template<> struct hash< std::string > {
size_t operator()(const std::string& s) const {
return hash< const char* >()( s.c_str() );
}
};
}
typedef __gnu_cxx::hash_map<std::string, int> Hash;
#endif
...

and later:


...
Hash hash;
...
hash[ some_string ]++
...


Which works fine with all gcc variants I have plus
the actual Visual C++ compilers.
 
L

Lionel B

On Jul 16, 10:53 pm, Mirco Wahab <[email protected]> wrote:

[...]
There will be an std::unordered_set and std::unordered_map in the next
version of the standard, implemented using hash tables, and there will
be standard hash functions for most of the common types.

GNU g++ has supported those for quite a while in tr1, it seems.
(I wonder, however. Is the quality of the hashing function going to be
guaranteed?)

By whom/what? I don't think the standard makes any guarantees. I've only
got a draft here, which says just:

6.3.3 Class template hash [tr.unord.hash]

1 The unordered associative containers defined in this clause use
specializations of hash as the default hash function. This class template
is only required to be instantiable for integer types
([basic.fundamental]), floating point types ([basic.fundamental]),
pointer types ([dcl.ptr]), and std::string and std::wstring.

template <class T>
struct hash : public std::unary_function<T, std::size_t>
{
std::size_t operator()(T val) const;
};

2 The return value of operator() is unspecified, except that equal
arguments yield the same result. operator() shall not throw exceptions.

Still, you can always roll your own [possibly inappropriate metaphor
alert]
 
M

Mirco Wahab

James said:
It depends. You might like to have a look at my "Hashing.hh"
header (in the code at kanze.james.neuf.fr/code-en.html---the
Hashing component is in the Basic section). Or for a discussion
and some benchmarks,
http://kanze.james.neuf.fr/code/Docs/html/Hashcode.html. (That
article is a little out of date now, as I've tried quite a few
more hashing algorithms. But the final conclusions still hold,
more or less.)

Ah, thanks for the links. I'll work through it. I see, you
took relatively small working sets. (I considered my 14MB
setup "small" ;-)

I'd try to use your implementation in comparision but
don't know which files are really necessary. Do you
have a .zip of the hash stuff?
There will be an std::unordered_set and std::unordered_map in
the next version of the standard, implemented using hash tables,
and there will be standard hash functions for most of the common
types. (I wonder, however. Is the quality of the hashing
function going to be guaranteed?)

We'll see - if some usable implementations show up. In the mean time,
the old hash_map seems to be "good enough" for my kind of stuff.
I did additional tests regarding the *reading* speed from the map.

The whole problem would be now:

1) read a big text to memory (14 MB here)
2) tokenize it (by simple regex, this seems to be very fast or fast enough)
3) put the tokens (words) into a hash and/or increment their frequencies
4) sort the hash keys (the words) according to their frequencies into a vector
5) report highest (2) and lowest (1) frequencies

Now I added 4 and 5. The tree-based std::map falls further behind
(as expected). The ext/hash_map keeps its margin.

std::map (1-5) 0m8.227s real
Perl (1-5) 0m4.732s real
ext/hash_map (1-5) 0m4.465s real

Maybe I didn't find the optimal solution for copying the hash keys to the
vector (I'll add the source at the end).

From "visual inspection" of the test runs, it
can be seen that the array handling (copying
from hash to vector) is very efficient in Perl.

Furthermore, I run into the problem of how-to access the hash values
from a sort function. The only solution that (imho) doesn't involve
enormous complexity, just puts the hash module-global. How to cure that?

Regards

M.

Addendum:

[perl source] ==>
my $fn = 'fulltext.txt';
print "start slurping\n";
open my $fh, '<', $fn or die "$fn - $!";
my $data; { local $/; $data = <$fh> }

my %hash;
print "start hashing\n";
++$hash{$1} while $data =~ /(\w\w*)/g;

print "start sorting (ascending, for frequencies)\n";
my @keys = sort { $hash{$a} <=> $hash{$b} } keys %hash;

print "done, $fn (" . int(length($data)/1024) . " KB) has "
. (scalar keys %hash) . " different words\n";

print "infrequent: $keys[0] = $hash{$keys[0]} times\n"
. "very often: $keys[-2] = $hash{$keys[-2]} times\n"
. "most often: $keys[-1] = $hash{$keys[-1]} times\n"
<==

[hash_map source]==>
#include <boost/regex.hpp>
#include <algorithm>
#include <iostream>
#include <fstream>
#include <string>

// define this to use the tree-based std::map
#ifdef USE_STD_MAP
#include <map>
typedef std::map<std::string, int> StdHash;
#else
#if defined (_MSC_VER)
#include <hash_map>
typedef stdext::hash_map<std::string, int> StdHash;
#else
#include <ext/hash_map>
namespace __gnu_cxx {
template<> struct hash< std::string > {
size_t operator()(const std::string& s) const {
return hash< const char* >()( s.c_str() );
} // gcc.gnu.org/ml/libstdc++/2002-04/msg00107.html
}; // allow the gnu hash_map to work on std::string
}
typedef __gnu_cxx::hash_map<std::string, int> StdHash;
#endif
#endif

char *slurp(const char *fname, size_t* len);
size_t word_freq(const char *block, size_t len, StdHash& hash);

// *** ouch, make it a module global? ***
StdHash hash;
// *** how do we better compare on the external hash? ***
struct ExtHashSort { // comparison functor for sort()
bool operator()(const std::string& a, const std::string& b) const {
return hash[a] < hash;
}
};

int main()
{
using namespace std;
size_t len, nwords;

const char *fn = "fulltext.txt"; // about 14 MB
cout << "start slurping" << endl;
char *block = slurp(fn, &len); // read file into memory

// StdHash hash; no more!
cout << "start hashing" << endl;
nwords = word_freq(block, len, hash); // put words into a hash
delete [] block; // no longer needed

cout << "done, " << fn << " (" << len/1024
<< "KB) has " << nwords << " different words" << endl;

vector<string> keys;
keys.reserve(nwords);

cout << "sorting out the longest and shortest words" << endl;
StdHash::const_iterator p, end; // copy keys to vector
for(p=hash.begin(),end=hash.end(); p!=end; ++p) keys.push_back(p->first);
sort(keys.begin(), keys.end(), ExtHashSort()); // sort by hashed number value

cout << "infrequent:" << keys[0] << "=" << hash[keys[0]] << " times\n"
<< "very often:" << keys[nwords-2] << "=" << hash[keys[nwords-2]] << " times\n"
<< "most often:" << keys[nwords-1] << "=" << hash[keys[nwords-1]] << " times\n";

return 0;
}

char *slurp(const char *fname, size_t* len)
{
std::ifstream fh(fname); // open
fh.seekg(0, std::ios::end); // get to EOF
*len = fh.tellg(); // read file pointer
fh.seekg(0, std::ios::beg); // back to pos 0
char* data = new char [*len+1];
fh.read(data, *len); // slurp the file
return data;
}

size_t word_freq(const char *block, size_t len, StdHash& hash)
{
using namespace boost;
match_flag_type flags = match_default;
static regex r("\\w\\w*");
cmatch match;

const char *from=block, *to=block+len;
while( regex_search(from, to, match, r, flags) ) {
hash[ std::string(match[0].first, match[0].second) ]++;
from = match[0].second;
}
return hash.size();
}
<==
 
G

Gernot Frisch

char *slurp(const char *fname, size_t* len)
{
std::ifstream fh(fname); // open
fh.seekg(0, std::ios::end); // get to EOF
*len = fh.tellg(); // read file pointer
fh.seekg(0, std::ios::beg); // back to pos 0
char* data = new char [*len+1];
fh.read(data, *len); // slurp the file
return data;
}

Try using fopen/fread here. Some implementations of the C++ fstreams are
much, much slower that the CRT.
 
J

James Kanze

[...]
Q2: Which "future" can be expected regarding "hashing"?
There will be an std::unordered_set and std::unordered_map
in the next version of the standard, implemented using hash
tables, and there will be standard hash functions for most
of the common types.
GNU g++ has supported those for quite a while in tr1, it seems.
By whom/what?

By the standard.
I don't think the standard makes any guarantees. I've only got
a draft here, which says just:
6.3.3 Class template hash [tr.unord.hash]
1 The unordered associative containers defined in this clause use
specializations of hash as the default hash function. This class template
is only required to be instantiable for integer types
([basic.fundamental]), floating point types ([basic.fundamental]),
pointer types ([dcl.ptr]), and std::string and std::wstring.
template <class T>
struct hash : public std::unary_function<T, std::size_t>
{
std::size_t operator()(T val) const;
};
2 The return value of operator() is unspecified, except that equal
arguments yield the same result. operator() shall not throw exceptions.

That's about what I expected, and more or less what I said.
(But I do hope they add a few more types. There's no way you
can write a hash function on std::type_info, for example, yet it
seems quite reasonable to me to want to use it as an index in an
unordered_map.)
Still, you can always roll your own [possibly inappropriate
metaphor alert]

Which, for most people, is likely to be worse than whatever is
in the library; while there are no guarantees, I'm willing to
bet that most implementations will do something which is fairly
good most of the time. (But if you're willing to consider
special data sets, it's relatively trivial to get tons of
collisions with the string hashing functions in g++'s
implementation.)
 
J

James Kanze

Ah, thanks for the links. I'll work through it. I see, you
took relatively small working sets. (I considered my 14MB
setup "small" ;-)

Basically, I took what I had handy, or could easily generate.
And I intentionally used sets of very different sizes, because
part of my goal was to determine at what point hash tables
started significantly beating std::map. (At the time, there was
no proposal for a standard hash table, and it was a question of
how many entries did one need before going to something
non-standard.

With regards to data sets, there are at least two others that
I'd like to add: a very big set (more than 10000 entries) of
URL's, and a set of all two character strings. I can generate,
and in fact have generated the latter, but I don't know off hand
where to find the former.
I'd try to use your implementation in comparision but
don't know which files are really necessary. Do you
have a .zip of the hash stuff?

Not of just the hash stuff; you'd have to down-load the entire
library. There aren't too many files in the Hashing component,
however, and it shouldn't be too difficult to remove the
dependencies that it has on othe files. (The only one which
comes to mind is that it depends on <gb/stdint.h> for
GB_uint32_t. If your compiler has <stdint.h>, you can use it
and uint32_t instead.)

You can also look at the benchmark code in the Benchmark
sub-system. There are a lot of dependencies there, since it
uses my usual BenchHarness, but it shouldn't be too difficult to
extract the actual hash algorithms to play around with.
We'll see - if some usable implementations show up. In the mean time,
the old hash_map seems to be "good enough" for my kind of stuff.
I did additional tests regarding the *reading* speed from the map.
The whole problem would be now:
1) read a big text to memory (14 MB here)
2) tokenize it (by simple regex, this seems to be very fast or fast enough)
3) put the tokens (words) into a hash and/or increment their frequencies
4) sort the hash keys (the words) according to their frequencies into a vector
5) report highest (2) and lowest (1) frequencies
Now I added 4 and 5. The tree-based std::map falls further behind
(as expected). The ext/hash_map keeps its margin.
std::map (1-5) 0m8.227s real
Perl (1-5) 0m4.732s real
ext/hash_map (1-5) 0m4.465s real

Just curious, but what is the time for just reading the file? I
wouldn't be surprised if that doesn't account for a large part
of the time. In which case, the biggest speed up might be
there: using system level IO or memory mapping the file.
(Neither of those would be portable, though.)
Maybe I didn't find the optimal solution for copying the hash
keys to the vector (I'll add the source at the end).

Since you're reading the entire file into a single block of
contiguous memory (for which you really could use std::vector),
you really don't have to ever copy anything. Just put a pointer
to the start of the word in your data structures, and put a nul
character behind the word. (This should definitely speed things
up compared to using string: no more dynamic allocations at
all.)
From "visual inspection" of the test runs, it
can be seen that the array handling (copying
from hash to vector) is very efficient in Perl.
Furthermore, I run into the problem of how-to access the hash
values from a sort function. The only solution that (imho)
doesn't involve enormous complexity, just puts the hash
module-global. How to cure that?

Pass by reference?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,962
Messages
2,570,134
Members
46,690
Latest member
MacGyver

Latest Threads

Top