performance of HashSet with strings and integers

F

Frederik

Hi,

I thought of replacing strings with integers in some code I wrote,
because I believed it would benefit performance. But before doing so,
I made a small test class:

public class testStringSetVsIntSet {
public static void main(String[] args) {
long time;
boolean b;
Set set;

Integer I1 = new Integer(100), I2 = new Integer(500);

set = new HashSet();
set.add(I1);
set.add(900);

time = System.currentTimeMillis();
for (int i=0; i<50000000; i++) {
b = set.contains(I1);
b = set.contains(I2);
}

time = System.currentTimeMillis() - time;
System.out.println("Time 1: " + time);

String headStr = "Head";
String agentStr = "Agent";
String qualifStr = "Qualif";

set = new HashSet();
set.add(headStr);
set.add(agentStr);

time = System.currentTimeMillis();
for (int i=0; i<50000000; i++) {
b = set.contains(headStr);
b = set.contains(qualifStr);
}

time = System.currentTimeMillis() - time;
System.out.println("Time 2: " + time);
}
}

But to my surprise, the second loop with the strings appeared to be
twice as fast as the first one with the integers! (first loop 3
seconds, second 1.5 seconds)
I didn't expect this because calculating the hashcode for a string is
normally slower than for an integer (every string character is taken
into account) and I thought the "equals" method for a string should be
slower than for an Integer as well.
Can anybody explain this to me?
 
M

Mike Schilling

Frederik said:
Hi,

I thought of replacing strings with integers in some code I wrote,
because I believed it would benefit performance. But before doing
so,
I made a small test class:

public class testStringSetVsIntSet {
public static void main(String[] args) {
long time;
boolean b;
Set set;

Integer I1 = new Integer(100), I2 = new Integer(500);

set = new HashSet();
set.add(I1);
set.add(900);

time = System.currentTimeMillis();
for (int i=0; i<50000000; i++) {
b = set.contains(I1);
b = set.contains(I2);
}

time = System.currentTimeMillis() - time;
System.out.println("Time 1: " + time);

String headStr = "Head";
String agentStr = "Agent";
String qualifStr = "Qualif";

set = new HashSet();
set.add(headStr);
set.add(agentStr);

time = System.currentTimeMillis();
for (int i=0; i<50000000; i++) {
b = set.contains(headStr);
b = set.contains(qualifStr);
}

time = System.currentTimeMillis() - time;
System.out.println("Time 2: " + time);
}
}

But to my surprise, the second loop with the strings appeared to be
twice as fast as the first one with the integers! (first loop 3
seconds, second 1.5 seconds)
I didn't expect this because calculating the hashcode for a string
is
normally slower than for an integer (every string character is taken
into account) and I thought the "equals" method for a string should
be
slower than for an Integer as well.
Can anybody explain this to me?

1. Since Strings are immutable, their hash code is calculated only
once and then stored in the String object.
2. Since your strings are of different lengths, they're discovered to
be unequal almost immediately (the second thing equals() checks is the
number of characters).
3. When contains() returns true, its parameter is nott just equal to
the one in the set but the same object; thus it's discovered to be
equal even faster (since the first thing equals() checks is "this ==
param").
4.This one is just a guess, but I wouldn't be surprised if the Integer
version became faster if you reversed the order. In general, Java
gets faster as it goes and the JIT can optimize the code.
 
P

Pascal Lecointe

The equals method only gets called if the HashSet contains an element
whose hash code is equal to the probe's hash code but that is not the
same object as the probe. That is very unlikely in a test with so few
objects.

Not really. If I remember correctly, the equals is always called
(because same
hash not imply same object). But in this case, the equals first test
that the
references are the same. As the Strings of the example are all in the
constant pool,
the references are the same, and so the equals test is fast.
I strongly suspect that your real use of the HashSet is significantly
different from your benchmark. The large number of contains calls with
the same key, the small number of distinct strings, and the lack of any
case in which you have two distinct but equal objects may all affect the
results.

+1
 
M

Mike Schilling

Patricia said:
The equals method only gets called if the HashSet contains an
element
whose hash code is equal to the probe's hash code but that is not
the
same object as the probe. That is very unlikely in a test with so
few
objects.

Does HashSet check == rather than letting the equals() method do it?
So it does. You learn something new every day. Though in the case,
as I mentioned earlier, String.equals() would be as fast as
Integer.equals().
 
F

Frederik

4.This one is just a guess, but I wouldn't be surprised if the Integer
version became faster if you reversed the order.  In general, Java
gets faster as it goes and the JIT can optimize the code.

I tried that, but that is not the case, the version with the strings
remains twice as fast.

Thank you both for your help.
 
M

Mike Schilling

Frederik said:
I tried that, but that is not the case, the version with the strings
remains twice as fast.

In that case, I'd guess you picked numbers that hash to the same
bucket. Yup. The initial capacity of a HashSet is 16, and all the
numbers equal 0 mod 16.
 
L

Lew

Pascal Lecointe erroneously claimed:
Not really. If I remember correctly, the equals is always called

You don't remember correctly.
(because same hash not imply same object).

No, but different hash codes do imply different objects. Note that
Patricia pointed out that 'equals()' is only called if the hash codes
are the same. It is not called if the hash codes differ, because in
that case 'equals()' would always return 'false', so there's no point.
But in this case, the equals first test
that the
references are the same. As the Strings of the example are all in the
constant pool,
the references are the same, and so the equals test is fast.

Really, really fast in the OP's example, because 'equals()' is not
called.

It has nothing to do with the Strings being in the constant pool.
 
L

Lew

The match test in HashMap's getEntry is such that either hash code
mismatch or reference equality prevents execution of the equals call.
The implementation of the HashSet contains uses the HashMap containsKey,
which uses getEntry.

Note also that typical implementations of 'equals()' that do not do
mere reference equality do check reference equality first, and only
compare attributes if the references differ. Of course, there's no
guarantee that any particular 'equals()' implementation does this. In
the case of the logic just described, it may well be that reference
equality is checked twice - once by the 'Map' and once by the element
- before value equality is finally invoked.

To review:

HashMap equality checks:
hashCode(): if unequal, no need to continue
reference == : if 'true', no need to continue
value equality: return result
 
R

Roedy Green

Set set;
Integer I1 = new Integer(100), I2 = new Integer(500);
set = new HashSet();

Aside from the main thrust of your question, you should code this with
generics and autoboxing like this:

final Set<Integer> set = new HashSet<Integer>(7);
// give it a size estimate.
Integer i1 = 100; // follow caps naming conventions
Integer i2 = 500; // use autoboxing for brevity
--
Roedy Green Canadian Mind Products
http://mindprod.com

I advocate that super programmers who can juggle vastly more complex balls than average guys can, should be banned, by management, from dragging the average crowd into system complexity zones where the whole team will start to drown.
~ Jan V.
 
K

Kevin McMurtrie

Frederik said:
Hi,

I thought of replacing strings with integers in some code I wrote,
because I believed it would benefit performance. But before doing so,
I made a small test class:

public class testStringSetVsIntSet {
public static void main(String[] args) {
long time;
boolean b;
Set set;

Integer I1 = new Integer(100), I2 = new Integer(500);

set = new HashSet();
set.add(I1);
set.add(900);

time = System.currentTimeMillis();
for (int i=0; i<50000000; i++) {
b = set.contains(I1);
b = set.contains(I2);
}

time = System.currentTimeMillis() - time;
System.out.println("Time 1: " + time);

String headStr = "Head";
String agentStr = "Agent";
String qualifStr = "Qualif";

set = new HashSet();
set.add(headStr);
set.add(agentStr);

time = System.currentTimeMillis();
for (int i=0; i<50000000; i++) {
b = set.contains(headStr);
b = set.contains(qualifStr);
}

time = System.currentTimeMillis() - time;
System.out.println("Time 2: " + time);
}
}

But to my surprise, the second loop with the strings appeared to be
twice as fast as the first one with the integers! (first loop 3
seconds, second 1.5 seconds)
I didn't expect this because calculating the hashcode for a string is
normally slower than for an integer (every string character is taken
into account) and I thought the "equals" method for a string should be
slower than for an Integer as well.
Can anybody explain this to me?

A few things:

- Strings cache their hash value in recent JVMs.
- The performance of hashing varies for individual keys based on
collisions and locations within the collision chain.
- All of your map hits can use the fast '==' test rather than equals(),
making collisions an unusually large factor of performance.
- Hotspot was compiling during loop 1. Call the test method twice and
the second results will probably be different.

nitpick: Integer.valueOf(n) and autoboxing returns pooled objects for a
range of values so it's sometimes much more efficient than the Integer
constructor. It's not in the loop so it doesn't matter here.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,150
Members
46,696
Latest member
BarbraOLog

Latest Threads

Top