M
Mirco Wahab
Hi,
recently I tried to mimic a simple word frequency
counter in C++ that uses hashing.
Assumed we have a somehow big text file (14 MB) that
contains >93,000 *different* words. To get word
count and frequencies, one would use:
[Perl, don't boggle]
my $fn = 'fulltext.txt';
print "start slurping\n";
open my $fh, '<', $fn or die "$fn - $!";
my $data; { local $/; $data = <$fh> }
print "start hashing\n";
my %hash;
++$hash{$1} while $data=~/(\w\w*)/g;
print "done, $fn (" . int(length($data)/1024)
. " KB) has " . (scalar keys %hash) . " different words\n"
This is just the "first part". Build a hash with
all different words (simplified) and count their
frequencies at first.
This runs through on a test machine (P4/2666MHz, Linux, Perl 5.8.8)
in ~4s (real). I'd never expect a "complicated looking" C++
solution to beat that one.
After some fiddling with the gnu::hash_map (which works
also as a drop-in replacement for std::map), I'm stunned:
[File size: 14MB; total words: 2,335,569; different words: 93,405]
std::map C++ implementation - 0m5.739s real
perl %hash implementation - 0m3.981s real
gnu::hash_map C++ implementation - 0m3.597s real ***
(I did three runs for each and took the best one.)
The hash_map implementation did work on gcc from 3.4.4
through 4.3.2 on Linux and Windows (Cygwin).
Q1: Does anybody else (besides me) like to "hash something"?
How do you do that? Boost (I didn't get this working
*without* their (experimental) /tr1 branch on Cygwin
and older Linuxes).
Q2: Which "future" can be expected regarding "hashing"?
Thanks & Regards
Mirco
PS.: I'll add the gnu::hash_map variant below. In
order to use std::map here, remove the "namespace hack"
on the beginning and modify the typedef. Thats it.
==>
#include <boost/regex.hpp>
#include <ext/hash_map> // Gnu gcc specific, switch to <map>
#include <iostream>
#include <fstream>
#include <string>
// allow the gnu hash_map to work on std::string
namespace __gnu_cxx {
template<> struct hash< std::string > {
size_t operator()(const std::string& s) const {
return hash< const char* >()( s.c_str() );
}
}; /* gcc.gnu.org/ml/libstdc++/2002-04/msg00107.html */
}
// this is what we would love:
typedef __gnu_cxx::hash_map<std::string, int> Hash; // change to std::map
char *slurp(const char *fname, size_t* len);
int word_freq(const char *block, size_t len, Hash& hash);
int main()
{
using namespace std;
size_t len;
const char *fn = "fulltext.txt";
cout << "start slurping" << endl;
char *block = slurp(fn, &len);
Hash hash;
cout << "start hashing" << endl;
int n = word_freq(block, len, hash);
delete [] block;
cout << "done, " << fn << " (" << len/1024
<< "KB) has " << n << " different words" << endl;
return 0;
}
char *slurp(const char *fname, size_t* len)
{
std::ifstream fh(fname); // open
fh.seekg(0, std::ios::end); // get to EOF
*len = fh.tellg(); // read file pointer
fh.seekg(0, std::ios::beg); // back to pos 0
char* data = new char [*len+1];
fh.read(data, *len); // slurp the file
return data;
}
int word_freq(const char *block, size_t len, Hash& hash)
{
using namespace boost;
match_flag_type flags = match_default;
static regex r("\\w\\w*");
cmatch match;
const char *from=block, *to=block+len;
while( regex_search(from, to, match, r, flags) ) {
hash[ std::string(match[0].first, match[0].second) ]++;
from = match[0].second;
}
return hash.size();
}
<==
recently I tried to mimic a simple word frequency
counter in C++ that uses hashing.
Assumed we have a somehow big text file (14 MB) that
contains >93,000 *different* words. To get word
count and frequencies, one would use:
[Perl, don't boggle]
my $fn = 'fulltext.txt';
print "start slurping\n";
open my $fh, '<', $fn or die "$fn - $!";
my $data; { local $/; $data = <$fh> }
print "start hashing\n";
my %hash;
++$hash{$1} while $data=~/(\w\w*)/g;
print "done, $fn (" . int(length($data)/1024)
. " KB) has " . (scalar keys %hash) . " different words\n"
This is just the "first part". Build a hash with
all different words (simplified) and count their
frequencies at first.
This runs through on a test machine (P4/2666MHz, Linux, Perl 5.8.8)
in ~4s (real). I'd never expect a "complicated looking" C++
solution to beat that one.
After some fiddling with the gnu::hash_map (which works
also as a drop-in replacement for std::map), I'm stunned:
[File size: 14MB; total words: 2,335,569; different words: 93,405]
std::map C++ implementation - 0m5.739s real
perl %hash implementation - 0m3.981s real
gnu::hash_map C++ implementation - 0m3.597s real ***
(I did three runs for each and took the best one.)
The hash_map implementation did work on gcc from 3.4.4
through 4.3.2 on Linux and Windows (Cygwin).
Q1: Does anybody else (besides me) like to "hash something"?
How do you do that? Boost (I didn't get this working
*without* their (experimental) /tr1 branch on Cygwin
and older Linuxes).
Q2: Which "future" can be expected regarding "hashing"?
Thanks & Regards
Mirco
PS.: I'll add the gnu::hash_map variant below. In
order to use std::map here, remove the "namespace hack"
on the beginning and modify the typedef. Thats it.
==>
#include <boost/regex.hpp>
#include <ext/hash_map> // Gnu gcc specific, switch to <map>
#include <iostream>
#include <fstream>
#include <string>
// allow the gnu hash_map to work on std::string
namespace __gnu_cxx {
template<> struct hash< std::string > {
size_t operator()(const std::string& s) const {
return hash< const char* >()( s.c_str() );
}
}; /* gcc.gnu.org/ml/libstdc++/2002-04/msg00107.html */
}
// this is what we would love:
typedef __gnu_cxx::hash_map<std::string, int> Hash; // change to std::map
char *slurp(const char *fname, size_t* len);
int word_freq(const char *block, size_t len, Hash& hash);
int main()
{
using namespace std;
size_t len;
const char *fn = "fulltext.txt";
cout << "start slurping" << endl;
char *block = slurp(fn, &len);
Hash hash;
cout << "start hashing" << endl;
int n = word_freq(block, len, hash);
delete [] block;
cout << "done, " << fn << " (" << len/1024
<< "KB) has " << n << " different words" << endl;
return 0;
}
char *slurp(const char *fname, size_t* len)
{
std::ifstream fh(fname); // open
fh.seekg(0, std::ios::end); // get to EOF
*len = fh.tellg(); // read file pointer
fh.seekg(0, std::ios::beg); // back to pos 0
char* data = new char [*len+1];
fh.read(data, *len); // slurp the file
return data;
}
int word_freq(const char *block, size_t len, Hash& hash)
{
using namespace boost;
match_flag_type flags = match_default;
static regex r("\\w\\w*");
cmatch match;
const char *from=block, *to=block+len;
while( regex_search(from, to, match, r, flags) ) {
hash[ std::string(match[0].first, match[0].second) ]++;
from = match[0].second;
}
return hash.size();
}
<==