HashSet keeps all nonidentical equal objects in memory

F

Frederik

Hi,

I've been doing java programming for over 10 years, but now I've
encoutered a phenomenon that I wasn't aware of at all.
I had an application in which I have a HashSet<String>. I added a lot
of different String objects to this HashSet, but many of the String
objects are equal to each other. Now, after a while my application ran
out of memory, even with -Xmx1500M. This happened when there were only
about 7000 different Strings in the set! I didn't understand this,
until I started adding the "intern()" of every String object to the
set instead of the original String object. Now the program needs
virtually no memory anymore.
There is only one explanation: before I used "intern()", ALL the
different String objects, even the ones that are equal, were kept in
memory by the HashSet! No matter how strange it sounds. I was
wondering, does anybody have an explanation as to why this is the case?
 
F

Frederik

Hi,

I've been doing java programming for over 10 years, but now I've
encoutered a phenomenon that I wasn't aware of at all.
I had an application in which I have a HashSet<String>. I added a lot
of different String objects to this HashSet, but many of the String
objects are equal to each other. Now, after a while my application ran
out of memory, even with -Xmx1500M. This happened when there were only
about 7000 different Strings in the set! I didn't understand this,
until I started adding the "intern()" of every String object to the
set instead of the original String object. Now the program needs
virtually no memory anymore.
There is only one explanation: before I used "intern()", ALL the
different String objects, even the ones that are equal, were kept in
memory by the HashSet! No matter how strange it sounds. I was
wondering, does anybody have an explanation as to why this is the case?

I did another experiment in which I generated a large file with the
same string on every line, and then wrote a program that reads in the
file and adds every line String to a HashSet. With that, I cannot
reproduce the problem. So in my other code I must be doing something
very specific which triggers this problem. I don't know what it is,
but never mind.
 
E

Eric Sosman

Hi,

I've been doing java programming for over 10 years, but now I've
encoutered a phenomenon that I wasn't aware of at all.
I had an application in which I have a HashSet<String>. I added a lot
of different String objects to this HashSet, but many of the String
objects are equal to each other. Now, after a while my application ran
out of memory, even with -Xmx1500M. This happened when there were only
about 7000 different Strings in the set! I didn't understand this,
until I started adding the "intern()" of every String object to the
set instead of the original String object. Now the program needs
virtually no memory anymore.
There is only one explanation: before I used "intern()", ALL the
different String objects, even the ones that are equal, were kept in
memory by the HashSet! No matter how strange it sounds. I was
wondering, does anybody have an explanation as to why this is the case?

I'm unable to reproduce your problem (see test program below).
Perhaps you've overlooked another possible explanation: Before you
switched to using intern(), maybe you were retaining your own
references to all those Strings accidentally.

Here's my test program: It inserts twenty thousand distinct but
identical Strings into a HashSet, pausing every now and then to
report how much memory is used (with some heavy-handed attempts to
force garbage collection):

package esosman.misc;
import java.util.HashSet;

public class HashSpace {

public static void main(String[] unused) {
HashSet<String> set = new HashSet<String>();
String value = "x";
for (int n = 0; n < 20; ++n) {
report(n * 1000);
for (int i = 0; i < 1000; ++i) {
value = (value + "x").substring(1);
set.add(value);
}
}
report(20 * 1000);
}

private static void report(int insertions) {
long memUsed = runtime.totalMemory() - runtime.freeMemory();
long memPrev = Long.MAX_VALUE;
for (int gc = 0; (memUsed < memPrev) && gc < 5; ++gc) {
runtime.runFinalization();
runtime.gc();
Thread.yield();
memPrev = memUsed;
memUsed = runtime.totalMemory() - runtime.freeMemory();
}
System.out.printf("After %d insertions, memory used = %d\n",
insertions, memUsed);
}

private static final Runtime runtime = Runtime.getRuntime();
}

.... and here's what I get for output:

After 0 insertions, memory used = 125656
After 1000 insertions, memory used = 133272
After 2000 insertions, memory used = 133664
After 3000 insertions, memory used = 133272
After 4000 insertions, memory used = 133312
After 5000 insertions, memory used = 133272
After 6000 insertions, memory used = 133312
After 7000 insertions, memory used = 133272
After 8000 insertions, memory used = 133312
After 9000 insertions, memory used = 133272
After 10000 insertions, memory used = 133312
After 11000 insertions, memory used = 133272
After 12000 insertions, memory used = 133312
After 13000 insertions, memory used = 133448
After 14000 insertions, memory used = 133840
After 15000 insertions, memory used = 133448
After 16000 insertions, memory used = 133488
After 17000 insertions, memory used = 133272
After 18000 insertions, memory used = 133312
After 19000 insertions, memory used = 133272
After 20000 insertions, memory used = 133312

I see no evidence that all those String instances are being
retained anywhere: They need ~24 bytes apiece, which would come
to about half a megabyte.
 
M

markspace

With that, I cannot
reproduce the problem. So in my other code I must be doing something
very specific which triggers this problem. I don't know what it is,
but never mind.


Ayup. As usual it's in the code we are not shown. Good job working out
the answer yourself though.

Hmm, looking at your problem description again, can you verify that when
the problem occurs, there are much more than 7000 strings in the set?
I.e., the set is really retaining all of the objects, or is there a
memory leak somewhere else?

It's possible to make strings such that they hold memory and cannot
release it. Are you calling substring() a lot to create the strings?
If so they could be holding memory from a larger string. If the
substrings are equal and not retained, but the large string is held in
memory because the smaller string will contain a pointer to it, this
could occur.

If so, try calling "new String( substr )" on the sub-strings to make an
entirely new object instead of intern(). If this also works, that was
likely your problem.

Try running the code under a profiler, with memory profiling enabled
(and the problem verify to occur first), and post the results.



Other ideas that are probably not as workable:

Make sure it's really HashSet and not some derived (and broken)
implementation. There are also things called IdentityHashSet out there
in the wild that will give you the behavior are seeing.


Make sure it's really String and not something else. I don't know how
this could happen but it is pretty common when putting user defined
objects to mess up the hashcode() and equals() methods, which will break
containers like HashSet and HashMap.
 
R

Robert Klemme

I've been doing java programming for over 10 years, but now I've
encoutered a phenomenon that I wasn't aware of at all.

Apparently you didn't - as you found out in the meantime. :)
I had an application in which I have a HashSet<String>. I added a lot
of different String objects to this HashSet, but many of the String
objects are equal to each other. Now, after a while my application ran
out of memory, even with -Xmx1500M. This happened when there were only
about 7000 different Strings in the set! I didn't understand this,
until I started adding the "intern()" of every String object to the
set instead of the original String object. Now the program needs
virtually no memory anymore.
There is only one explanation: before I used "intern()", ALL the
different String objects, even the ones that are equal, were kept in
memory by the HashSet! No matter how strange it sounds. I was
wondering, does anybody have an explanation as to why this is the case?

No, that conclusion is not warranted by the facts. You only know that
*something* kept hold of a lot of memory (String instances). Since we
do neither know all the code nor do we know the application
architecture we can only speculate but it seems a realistic assumption
that those String instances are not only kept by the HashSet but
somewhere else.

An easy way you can create such a situation is that you are reading
from some external source (file) repeated content and create an object
which - among other things - holds the String. Now you have 1,000,000
objects holding on to 1,000,000 String instances but there are only
7,000 different character sequences. In such a situation it may be
better to have a HashMap<String,String> where you store the String
only once and reuse that first instance. Basically this is what
happened when you used String.intern() only that you do not have
control over this storage any more which - depending on application
type - can still create a serious memory leak, e.g. long running app
which over time reads multiple files with different sets of repeated
strings.

Kind regards

robert
 
L

lewbloch

Apparently you didn't - as you found out in the meantime. :)


No, that conclusion is not warranted by the facts.  You only know that
*something* kept hold of a lot of memory (String instances).  Since we
do neither know all the code nor do we know the application
architecture we can only speculate but it seems a realistic assumption
that those String instances are not only kept by the HashSet but
somewhere else.

An easy way you can create such a situation is that you are reading
from some external source (file) repeated content and create an object
which - among other things - holds the String.  Now you have 1,000,000
objects holding on to 1,000,000 String instances but there are only
7,000 different character sequences.  In such a situation it may be
better to have a HashMap<String,String> where you store the String
only once and reuse that first instance.  Basically this is what
happened when you used String.intern() only that you do not have
control over this storage any more which - depending on application
type - can still create a serious memory leak, e.g. long running app
which over time reads multiple files with different sets of repeated
strings.

To highlight one of Robert's points much more specifically,
undisciplined use of 'intern()' can create memory pressure itself.
It's not really a good idea to intern every single 'String' because
that uses up the intern space and disables GC to clean up dead
strings.

Sweeping dirt under the carpet makes dirt less visible, not the floor
clean.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,740
Latest member
AdolphBig6

Latest Threads

Top