Hash table in C++? STL?

J

John Harrison

Thanks for everyone's help so it looks like map is the standard and
should be available for any platform? But hash_map is not standard.

Right.

The reason I am asking for hash table is because I need to write a
function that finds the overlap of two integer array. I am currently
doing:

#include<stdlib.h>
#include<string.h>
#include<sys/int_limits.h>

#define SIZE ((sizeof(int)) * (INT16_MAX) * (2))

void find(int* one,int c1,int* two,int c2){
int i;
int* buffer = malloc(SIZE);
memset((void*)buffer,0,SIZE);
for(i = 0; i < c1; i++)
buffer[one + INT16_MAX] = 1;
for(i = 0; i < c2; i++)
if(buffer[two + INT16_MAX])
printf("%d\n",two);
free(buffer);
}

int main(int argc,char** argv){
int a[] = {1,3,45,6,4,-1,8};
int b[] = {4,2,4,222,8,45,-1};
find(a,7,b,7);
}

this gives me a O(n) algr. but seems to be wasting a lot of memory so
I am thinking using a hashtable instead. am i in the right track?
Thanks!


I think the standard answer would be to use a set. This is untested code.

#include <set>
#include <algorithm>
#include <iterator>
#include <iostream>

void find(int* one,int c1,int* two,int c2)
{
std::set<int> common;
common.insert(one, one + c1);
common.insert(two, two + c2);
// output common elements to cout
std::copy(common.begin(), common.end(),
std::eek:stream_iterator<int>(std::cout, "\n"));
}

That is an O(N lg N) algorithm.

john
 
K

Kevin Goodsell

Siemel said:
I did some research, and they said multiplying by the alphabet size (ie.
number of valid chars).

I don't recall that. The particular version I was talking about has been
discussed from time to time in the newsgroups, and I found one such
discussion here:

http://groups.google.com/[email protected]

If I did that right, the link should take you to Richard Heathfield's
reply to the thread's original poster.

They mention a few points. First, K&R2 contains an almost identical
hashing function on page 144 using 31 as the constant. Second, the
values suggested by K&P were apparently 31 and 37. Finally, Chris Torek
suggests 33, a value he got from James Gosling. During widespread
experimentation for the Berkley DB library, it did better on average
than 31.

-Kevin
 
P

pembed2003

John Harrison said:
http://www.dinkumware.com/refxcpp.html

but remember because hasp_map is non standard the description here might
differ from the implementation you have (if you have one at all).

john

I read the documentation but I don't fully understand it. I don't want
to sound like a dump axx but so far, I have tried:

#include<map>
#include<string>

using namespace std;

int main(int argc,char** argv){
std::map<std::string,int>m();
m.insert("hello",1); // doesn't compile? why?
return 0;
}

I am sure I did something wrong maybe even in the constructor. What I
mean to do is:

* m is a map whoes key is a std::string and values will be of type int
* insert the string "hello" into m with value 1.

I am trying to read the man page with 'man map.h' but nothing show up.
Where is the man page and how can I find map.h if I want to look at
it?

Can someone tell me what's wrong? Can you post a short working example
so I can learn from it? Thanks!
 
K

Kevin Goodsell

pembed2003 said:
I read the documentation but I don't fully understand it. I don't want
to sound like a dump axx but so far, I have tried:

#include<map>
#include<string>

using namespace std;

int main(int argc,char** argv){
std::map<std::string,int>m();

I don't think you really wanted to declare a function called 'm' that
takes no arguments and returns a std::map. Lose the parentheses.
m.insert("hello",1); // doesn't compile? why?

I don't believe there is a form of map::insert that takes two arguments.
Try one of these:

m["hello"] = 1;

OR

m.insert(std::map<std::string, int>::value_type("hello", 1));

Obviously I'd recommend the former for this case. When you do need the
insert() function it might help to have a typedef handy (in fact, this
is nice in many cases):

typedef std::map<std::string, int> Map;
Map m;
m.insert(Map::value_type("hello", 1));
return 0;
}

I am sure I did something wrong maybe even in the constructor. What I
mean to do is:

* m is a map whoes key is a std::string and values will be of type int
* insert the string "hello" into m with value 1.

I am trying to read the man page with 'man map.h' but nothing show up.
Where is the man page and how can I find map.h if I want to look at
it?

1) 'man pages' are not defined by the standard. If you have them, the
names, locations, commands, etc. depend on your system.

2) Why would it be map.h? The header is <map>. Typically the file name
is the same as the name in the brackets, i.e. 'map'. This is not
required of course, but it's usually the case.

3) Locations of headers (if they exist as files -- which they typically
do) depend on your implementation. Consult the documentation or a group
that discusses your system and/or compiler.

-Kevin
 
K

Kevin Goodsell

Kevin said:
I don't believe there is a form of map::insert that takes two arguments.

Come to think of it, that's wrong. There is a form that takes a "hint"
argument, for example. But there is not one that takes the key and value
as separate arguments, I'm fairly sure.

-Kevin
 
J

Jerry Coffin

[ ... ]
size_t hash(const char * s, int buckets) {
size_t h = 0;
for ( ; *s; ++s) h += *s;
return h%bucket.
}

But we need hash only the first maxchars characters.

As hash functions go, this is better than many, but less than ideal
for general use. Long strings might produce values larger than a
size_t can represent, producing values that vary from one
implementation to another, while short strings can only produce quite
small values (e.g. with 8-bit characters, a 3-character string can
only produce a value up to 765 and typical English text will only
produce values in a MUCH more restricted range). As one final caveat,
since it uses plain char, which might be signed, it could produce
negative hash values, which can often cause problems.

One way of dealing with this is something like this:

// warning: untested code.

size_t hash(unsigned char const *key, size_t buckets) {

const int bits = 32; // assume 32 bits is enough for the hash
value.
size_t h = 0x55555555; // alternating 0's and 1's.
size_t carry;
size_t shift = bits - CHAR_BIT;
size_t mask = ((1 << CHAR_BIT) -1) << shift;

while (*key) {
h ^= *key++;
carry = (h & mask) >> shift;
h <<= CHAR_BIT;
h |= carry;
}
return h % buckets;
}

The left-shift is to propagate the effects of the input characters
across the entire hash value more quickly. The right-shift (producing
an N-bit rotate) keeps from losing the effects of the characters from
early in the string.

The result is that every bit in the input string has an effect on the
output, and the effects of the characters get distributed throughout
the output fairly quickly.

Results are undoubtedly open to variation, but most of the
"improvements" I've tried to this scheme have taken more time than
they saved.
The hash may increase numbuckets dynamically as it grows, so numbuckets is a
parameter to operator(). But the number of chars to hash is a constant, and
should not change as we increase or decrease the number of buckets, so it is
argument to the constructor and gets stored in the class.

In typical use, you normally want to hash the entire key.
 
P

pembed2003

Siemel Naran said:
pembed2003 said:
#define SIZE ((sizeof(int)) * (INT16_MAX) * (2))

void find(int* one,int c1,int* two,int c2){
int i;
int* buffer = malloc(SIZE);
memset((void*)buffer,0,SIZE);
for(i = 0; i < c1; i++)
buffer[one + INT16_MAX] = 1;
for(i = 0; i < c2; i++)
if(buffer[two + INT16_MAX])
printf("%d\n",two);
free(buffer);
}


It seems you don't use the first INT16_MAX chars of buffer, or maybe I'm
reading something wrong.
int main(int argc,char** argv){
int a[] = {1,3,45,6,4,-1,8};
int b[] = {4,2,4,222,8,45,-1};
find(a,7,b,7);
}

this gives me a O(n) algr. but seems to be wasting a lot of memory so
I am thinking using a hashtable instead. am i in the right track?
Thanks!

If the array size 'a' or 'one' is small then the O(N^2) algorithm is fine.


Is my solution a O(N^2) solution? I think it's a O(N) right?
Becasue...

Assume array a and b both have N elements, the program loops through
both array exactly once, so we have:

for(i = 0; i < N; i++) // O(N)
array_one = 1;
for(i = 0; i < N; i++) // O(N)
does array_two in array_one

N + N = 2N

which is a O(N)
Either loop over the chars in 'one' or 'two', then over the chars in the
other loop. If you want to use STL, you can use std::count_if and supply a
binary predicate that employs std::find (the binary predicate will have a
constructor that takes the array 'one' as an argument).

Will a binary search beat a hash lookup? I know this has no exact
answer but generally speaking will a binary search be faster than a
hash lookup?
Note you can also sort the elements in 'one' and 'two', then loop over both
containers simultaneously. Running time O(lg(N)) for the sort if using
std::sort which normally uses quicksort, and O(N) for the scan. I don't
think the standard provides such an algorithm though.

But this solution requires:

1. Sort one array with best case O(N*log N) using a quicksort
2. Loop through another array lookup for dups which is a O(N)

I think it goes like:

sort(array_one); // O(N*log N)
for(i = 0; i < N; i++) // O(N)
look for array_two in array_one

which is a O(N*log N) algr. It doesn't look like it will beat my O(N)
algr. Am I right?

After I discovered map, I am actually doing:

void find(int* a,int c1,int* b,int c2){
std::map<int,int> m;
for(int i = 0; i < c1; i++)
m[a] = 1;
for(int j = 0; j < c2; j++)
if(m[b[j]])
std::cout<<b[j]<<std::endl;
}

What do you think about that? Thanks!
 
S

Siemel Naran

I don't believe there is a form of map::insert that takes two arguments.

Come to think of it, that's wrong. There is a form that takes a "hint"
argument, for example. But there is not one that takes the key and value
as separate arguments, I'm fairly sure.

Right. There's also another 2 argument insert, namely map::insert(Iter
begin, Iter end) where Iter is a template argument. It's basically batch
insert. If the range is sorted then the running time of the algorithm is
O(N). If the range is not sorted then the running time is O(N*lg(N)). So
this is a really smart function that detects if your range is sorted! I
think it probably calls the 2 argument insert where the first argument is a
hint.
 
S

Siemel Naran

Prefer to use const size_t instead of #define.
void find(int* one,int c1,int* two,int c2){
int i;
int* buffer = malloc(SIZE);
memset((void*)buffer,0,SIZE);
for(i = 0; i < c1; i++)
buffer[one + INT16_MAX] = 1;
for(i = 0; i < c2; i++)
if(buffer[two + INT16_MAX])
printf("%d\n",two);
free(buffer);
}
this gives me a O(n) algr. but seems to be wasting a lot of memory so
I am thinking using a hashtable instead. am i in the right track?
Thanks!


If the array size 'a' or 'one' is small then the O(N^2) algorithm is
fine.

Is my solution a O(N^2) solution? I think it's a O(N) right?


It is O(N). Sorry for confusing. The O(N^2) algorithm is when you scan all
the chars in 'one', and for each scan all the chars in 'two'.
Becasue...

Assume array a and b both have N elements, the program loops through
both array exactly once, so we have:

for(i = 0; i < N; i++) // O(N)
array_one = 1;
for(i = 0; i < N; i++) // O(N)
does array_two in array_one

N + N = 2N

which is a O(N)
Right.
Either loop over the chars in 'one' or 'two', then over the chars in the
other loop. If you want to use STL, you can use std::count_if and supply a
binary predicate that employs std::find (the binary predicate will have a
constructor that takes the array 'one' as an argument).

Will a binary search beat a hash lookup? I know this has no exact
answer but generally speaking will a binary search be faster than a
hash lookup?


Guessing hash lookup is faster in general. It's worth running an experiment
to see for your case. Who knows, maybe the hash function is too complicated
to compute and the sorted array is faster. It's pretty easy to all methods
given you have an implementation of hash_map available. If you can, post
your results if you do the experiment, because it's nice to know. (State
the size of N, result using: original algorithm, hash_map, std::map,
binary_search, N^2 method, double sort).
Note you can also sort the elements in 'one' and 'two', then loop over both
containers simultaneously. Running time O(lg(N)) for the sort if using
std::sort which normally uses quicksort, and O(N) for the scan. I don't
think the standard provides such an algorithm though.

But this solution requires:

1. Sort one array with best case O(N*log N) using a quicksort
2. Loop through another array lookup for dups which is a O(N)

I think it goes like:

sort(array_one); // O(N*log N)
for(i = 0; i < N; i++) // O(N)
look for array_two in array_one

which is a O(N*log N) algr. It doesn't look like it will beat my O(N)
algr. Am I right?


Yes, I think your O(N) algorithm is probably best, but it consumes lots of
space. Nevertheless, it is possible that the binary search method is faster
than hash_table.

Also, you can sort both arrays, and walk through each simultaneously in
linear time.
After I discovered map, I am actually doing:

void find(int* a,int c1,int* b,int c2){
std::map<int,int> m;
for(int i = 0; i < c1; i++)
m[a] = 1;
for(int j = 0; j < c2; j++)
if(m[b[j]])
std::cout<<b[j]<<std::endl;
}

What do you think about that? Thanks!


Looks right. You can use std::set too, which in principle should be a tad
faster. How fast does it run? Also note that a node in a map is pretty
expensive because you have to allocate space for one parent and two child
pointers, and if sizeof(int)==sizeof(node*), this is 4 times the memory per
node, plus the cost of maintaining the pointers. I'm willing to bet you a
$1 that sorting 'a' with std::sort and using std::binary_search instead is
faster than map and uses less memory.

You can replace std::map with std::hash_map if you are using STL. Don't
know about the other implementations, probably something similar.

original algorithm:
What you originally wrote
Still not sure why array size is double, why the multiply by 2 at the end?
const size_t SIZE = ((sizeof(int)) * (INT16_MAX) * (2))

std::map:
The above algorithm

hash_map (probably largest:
The above algorithm with std::map replaced by hash_map. Maybe there is a
function to set the bucket size that you can try to tweak for optimal
performance?

binary_search:
Similar to the above algorithm.
void find(int* a,int c1,int* b,int c2){
std::sort(a, a+c1);
for(int j = 0; j < c2; j++)
if(std::binary_search(a, a+c1, b[j]))
std::cout<<b[j]<<std::endl;
}

N^2 method:
void find(int* a,int c1,int* b,int c2){
for(int j = 0; j < c2; j++)
if(std::find(a, a+c1, b[j]))
std::cout<<b[j]<<std::endl;
}

double sort:
void find(int* a,int c1,int* b,int c2){
std::sort(a, a+c1);
std::sort(b, b+c1);
int i =0, j = 0;
while (true) {
if (i==c1 || j==c2) break;
const int ai = a;
const int bj = b[j];
if (ai < bj) ++i;
else if (ai > bj) ++j;
else {
std::cout<<bj<<std::endl;
++i;
++j;
}
}
}
It's 1:30am, so I don't claim the above is error free!
 
S

Siemel Naran

void find(int* one,int c1,int* two,int c2)
{
std::set<int> common;
common.insert(one, one + c1);
common.insert(two, two + c2);
// output common elements to cout
std::copy(common.begin(), common.end(),
std::eek:stream_iterator<int>(std::cout, "\n"));
}

That is an O(N lg N) algorithm.

This is a good algorithm!
 
J

Jerry Coffin

[ The original code was: ]
void find(int* one,int c1,int* two,int c2){
int i;
int* buffer = malloc(SIZE);
memset((void*)buffer,0,SIZE);
for(i = 0; i < c1; i++)
buffer[one + INT16_MAX] = 1;
for(i = 0; i < c2; i++)
if(buffer[two + INT16_MAX])
printf("%d\n",two);
free(buffer);
}


[ ... and you suggested replacing it with: ]
void find(int* one,int c1,int* two,int c2)
{
std::set<int> common;
common.insert(one, one + c1);
common.insert(two, two + c2);
// output common elements to cout
std::copy(common.begin(), common.end(),
std::eek:stream_iterator<int>(std::cout, "\n"));
}

At least as I read things, his code was finding (something similar to)
the intersection between the two sets, where it looks to me like yours
finds the union of them instead. If a true intersection of sets is
what he wants, he can use std::set_intersection to find it:

void find(int const *one, size_t c1, int const *two, size_t c2) {
std::set<int> a, b;

std::copy(one, one+c1, std::inserter(a, a.begin()));
std::copy(two, two+c2, std::inserter(b, b.begin()));
std::set_intersection(a.begin(), a.end(), b.begin(), b.end(),
std::eek:stream_iterator<int>(std::cout, "\n"));
}

What his code produces isn't exactly the intersection between the sets
though. In a set_intersection, each item in the output is unique,
whereas his code can produce a particular number in the output the
same number of times it occurs in the second array, if it also occurs
in the first array (e.g. with his code and sample data, the output
will contain 4 twice). If that's really needed, something like this
will do the job with roughly O(N * lg N) commplexity, and O(N) extra
storage:

void find(int *one, size_t c1, int *two, size_t c2) {
std::set<int> a;

std::copy(one, one+c1, std::inserter(a, a.begin()));
for (int i=0; i<c2; ++i)
if (std::find(a.begin(), a.end(), two)!=a.end())
std::cout << two << "\n";
}

Theoretically, std::remove_copy_if should probably be used instead of
the loop I have above, but that may be more trouble than it's worth
(though boost::lambda could help reduce the extra trouble somewhat).
Later,
Jerry.
 
P

pembed2003

Siemel Naran said:
Prefer to use const size_t instead of #define.

The only advantage of using a const size_t is:

* We have type checking in compile time
* size_t is more portable

Are there any other advantages?
void find(int* one,int c1,int* two,int c2){
int i;
int* buffer = malloc(SIZE);
memset((void*)buffer,0,SIZE);
for(i = 0; i < c1; i++)
buffer[one + INT16_MAX] = 1;
for(i = 0; i < c2; i++)
if(buffer[two + INT16_MAX])
printf("%d\n",two);
free(buffer);
}



I am trying to rewrite the malloc(SIZE) using:

int* buffer = new int[MAX_INT * 2];

but the compiler won't let me compile. I guess that's because MAX_INT
* 2 is outside the range of an integer and inside '[]' we can't have
anything bigger than an int. Is that correct? If so, how do you make a
really huge dynamic array:

int* buffer = new int[???];

What to put inside [] to make, say a 100000000 slots. This doesn't
work:

int* buffer = new int[10000000];

Seems like real disadvantage because malloc doesn't have that problem?

[snip]
original algorithm:
What you originally wrote
Still not sure why array size is double, why the multiply by 2 at the end?
const size_t SIZE = ((sizeof(int)) * (INT16_MAX) * (2))

The reason I need to multiple 2 is because if there are negative
number, I need to turn it into a positive number. For example:

-1234 is like:

array[-1234 + MAX_INT] = 1;

if I don't do that, I will end up:

array[-1234] = 1;

which is wrong. :-( Now consider we have:

1234, I need to make it so:

array[1234 + MAX_INT] = 1;

In other word, the range inside array is always:

array[-i + MAX_INT] to array[i + MAX_INT];

Suppose i = -MAX_INT, the whole thing still works because:

array[-MAX_INT + MAX_INT] to array[MAX_INT + MAX_INT];

array[0] to array[MAX_INT * 2];

so I need to multiply by 2. Maybe I am doing it wrong but if I don't
do that, how do I end up handling a negative number?

Thanks for all the help!
 
P

pembed2003

Kevin Goodsell said:
I don't think you really wanted to declare a function called 'm' that
takes no arguments and returns a std::map. Lose the parentheses.

That's it! Thanks for the help.
m.insert("hello",1); // doesn't compile? why?

I don't believe there is a form of map::insert that takes two arguments.
Try one of these:

m["hello"] = 1;

OR

m.insert(std::map<std::string, int>::value_type("hello", 1));

Very fancy! Now if I have a class Person, can I say:

Person p("first name","last name");
m.insert(std::map<Person, int>::value_type(p,1));

What functions does class Person has to define for this to work? Thanks!
 
K

Kevin Goodsell

pembed2003 said:
Very fancy! Now if I have a class Person, can I say:

Person p("first name","last name");
m.insert(std::map<Person, int>::value_type(p,1));

What functions does class Person has to define for this to work? Thanks!

Like all things stored in standard containers, it must meet the
requirements of Assignable and CopyConstructible types. Basically it has
to have normal value semantics.

In addition, to work as a map key, you need to supply a comparison
operation (that determines the ordering in the map). This defaults to
operator<(), so if that's defined it will work. You can also supply your
comparison criteria as a template parameter.

-Kevin
 
S

Siemel Naran

pembed2003 said:
The only advantage of using a const size_t is:

* We have type checking in compile time
* size_t is more portable

Are there any other advantages?

Also a practical one, you can usually see it in a debugger. I think there
are a few other ones.

I am trying to rewrite the malloc(SIZE) using:

int* buffer = new int[MAX_INT * 2];

but the compiler won't let me compile. I guess that's because MAX_INT
* 2 is outside the range of an integer and inside '[]' we can't have
anything bigger than an int. Is that correct? If so, how do you make a
really huge dynamic array:

Don't know how. Even std::malloc has this problem because its signature is

void * malloc(size_t);

but the compiler probably only checks the array size constraints when it
sees square brackets [...].

Just wondering, isn't INT_MAX*2 about the entire memory in the system, about
4GB on 32 bit int systems? Or how do these things work?

The reason I need to multiple 2 is because if there are negative
number, I need to turn it into a positive number. For example:

Got it. You could also convert the int to unsigned and use array size =
UINT_MAX.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,164
Messages
2,570,898
Members
47,440
Latest member
YoungBorel

Latest Threads

Top