M
Mike Thompson
The interning of strings has me puzzled. Its seems to happen sometimes,
but not others. I can't decern the pattern and I can't seem to find
documentation regarding it.
I can find documentation of a builtin called 'intern' but its use seems
frowned upon these days.
For example, using py2.3.3, I find that string interning does seem to
happen sometimes ...
True
And it even happens in this case ...
But not in what appears an almost identical case ...
BUT, oddly, it does seem to happen here ...
.... pass
....
Are there any language rules regarding when strings are interned and
then they are not? Should I be ignoring the apparent poor status of
'intern' and using it anyway? At worst, are there any CPyton 'accidents
of implementation' that I take advantage of?
Why do I need this? Well, I have to read in a very large XML document
and convert it into objects, and within the document many attributes
have common string values. To reduce the memory footprint, I'd like to
intern these commonly reference strings AND I'm wondering how much work
I need to do, and how much will happen automatically.
Any insights appreciated.
BTW, I'm aware that I can do string interning myself using a dict cache
(which is what ElementTree does internally). But, this whole subject
has got me curious now, and I'd like to understand a bit better. Would,
for example, using the builtin 'intern' give a better result than my
hand coded interning?
but not others. I can't decern the pattern and I can't seem to find
documentation regarding it.
I can find documentation of a builtin called 'intern' but its use seems
frowned upon these days.
For example, using py2.3.3, I find that string interning does seem to
happen sometimes ...
True
And it even happens in this case ...
True>>> s = "aa"
>>> s1 = s[:1]
>>> s2 = s[-1:]
>>> s1, s2 ('a', 'a')
>>> s1 is s2
But not in what appears an almost identical case ...
False>>> s = "the the"
>>> s1 = s[:3]
>>> s2 = s[-3:]
>>> s1, s2 ('the', 'the')
>>> s1 is s2
BUT, oddly, it does seem to happen here ...
.... pass
....
True>>> x = X()
>>> y = "the"
>>> x.the = 42
>>> x.__dict__ {'the': 42}
>>> y is x.__dict__.keys()[0]
Are there any language rules regarding when strings are interned and
then they are not? Should I be ignoring the apparent poor status of
'intern' and using it anyway? At worst, are there any CPyton 'accidents
of implementation' that I take advantage of?
Why do I need this? Well, I have to read in a very large XML document
and convert it into objects, and within the document many attributes
have common string values. To reduce the memory footprint, I'd like to
intern these commonly reference strings AND I'm wondering how much work
I need to do, and how much will happen automatically.
Any insights appreciated.
BTW, I'm aware that I can do string interning myself using a dict cache
(which is what ElementTree does internally). But, this whole subject
has got me curious now, and I'd like to understand a bit better. Would,
for example, using the builtin 'intern' give a better result than my
hand coded interning?