interning strings

M

Mike Thompson

The interning of strings has me puzzled. Its seems to happen sometimes,
but not others. I can't decern the pattern and I can't seem to find
documentation regarding it.

I can find documentation of a builtin called 'intern' but its use seems
frowned upon these days.

For example, using py2.3.3, I find that string interning does seem to
happen sometimes ...
True

And it even happens in this case ...
>>> s = "aa"
>>> s1 = s[:1]
>>> s2 = s[-1:]
>>> s1, s2 ('a', 'a')
>>> s1 is s2
True

But not in what appears an almost identical case ...
>>> s = "the the"
>>> s1 = s[:3]
>>> s2 = s[-3:]
>>> s1, s2 ('the', 'the')
>>> s1 is s2
False

BUT, oddly, it does seem to happen here ...
.... pass
....
>>> x = X()
>>> y = "the"
>>> x.the = 42
>>> x.__dict__ {'the': 42}
>>> y is x.__dict__.keys()[0]
True


Are there any language rules regarding when strings are interned and
then they are not? Should I be ignoring the apparent poor status of
'intern' and using it anyway? At worst, are there any CPyton 'accidents
of implementation' that I take advantage of?

Why do I need this? Well, I have to read in a very large XML document
and convert it into objects, and within the document many attributes
have common string values. To reduce the memory footprint, I'd like to
intern these commonly reference strings AND I'm wondering how much work
I need to do, and how much will happen automatically.

Any insights appreciated.

BTW, I'm aware that I can do string interning myself using a dict cache
(which is what ElementTree does internally). But, this whole subject
has got me curious now, and I'd like to understand a bit better. Would,
for example, using the builtin 'intern' give a better result than my
hand coded interning?
 
P

Peter Otten

Mike Thompson said:
The interning of strings has me puzzled.  Its seems to happen sometimes,
but not others. I can't decern the pattern and I can't seem to find
documentation regarding it.

Strings of length < 2 are always interned:
a = ""
a is "" True
a = " "
a is " " True
"aname"[1] is "n"
True

String constants that are potential attribute names are also interned:False

....although the algorithm to determine whether a string constant could be a
name is simplistic (see all_name_chars() in compile.c, or believe that it
does what its name suggests):True

Strings that are otherwise created are not interned:False

By the way, why would you want to mess with these implementation details?
Use the == operator to compare strings and be happy ever after :)

Peter
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Peter said:
String constants that are potential attribute names are also interned:

Peter has explained all this correctly, but this aspect needs some
stressing perhaps: string *literals* that are potential attribute
names are also interned. This interning is done in the compiler,
when the code object is created, so strings not created by the compiler
are not interned.

[all strings are "constant", i.e. immutable, so the statement
above might have been confusing]

Regards,
Martin
 
M

Mike Thompson

[snip very useful explanation]
By the way, why would you want to mess with these implementation details?
Use the == operator to compare strings and be happy ever after :)

'==' won't help me, I'm afraid.

I need to improve the speed and memory footprint of an application which
reads in a very large XML document.

Some elements in the incoming documents can be filtered out, so I've
written my own SAX handler to extract just what I want. All the same,
the content being read in is substantial.

So, to further reduce memory footprint, my SAX handler tries to manually
intern (using dicts of strings) a lot of the duplicated content and
attributes coming from the XML documents. Also, I use the SAX feature
'feature_string_interning' to hopefully intern the strings used for
attribute names etc.

Which is all working fine, except that now, as a final process, I'd like
to understand interning a bit more.

From your explanation there seems to be no language rules, just
implementation accidents. And none of those will be particularly
helpful in my case.

However, I still think I'm going to try using the builtin 'intern'
rather than my own dict cache. That may provide an advantage, even if it
doesn't work with unicode.
 
J

Jean Brouwers

A while ago, we faced a similar issue, trying to reduce total memory
usage and runtime of one of our Python applications which parses very
large log files (100+ MB).

One particular class is instantiated many times and changing just that
class to use __slots__ helped quite a bit. More details are here

<http://mail.python.org/pipermail/python-list/2004-May/220513.html>

/Jean Brouwers
ProphICy Semiconductor, Inc.



[snip very useful explanation]
By the way, why would you want to mess with these implementation details?
Use the == operator to compare strings and be happy ever after :)

'==' won't help me, I'm afraid.

I need to improve the speed and memory footprint of an application which
reads in a very large XML document.

Some elements in the incoming documents can be filtered out, so I've
written my own SAX handler to extract just what I want. All the same,
the content being read in is substantial.

So, to further reduce memory footprint, my SAX handler tries to manually
intern (using dicts of strings) a lot of the duplicated content and
attributes coming from the XML documents. Also, I use the SAX feature
'feature_string_interning' to hopefully intern the strings used for
attribute names etc.

Which is all working fine, except that now, as a final process, I'd like
to understand interning a bit more.

From your explanation there seems to be no language rules, just
implementation accidents. And none of those will be particularly
helpful in my case.

However, I still think I'm going to try using the builtin 'intern'
rather than my own dict cache. That may provide an advantage, even if it
doesn't work with unicode.
 
T

Tim Peters

[Mike Thompson]
...
From your explanation there seems to be no language rules, just
implementation accidents. And none of those will be particularly
helpful in my case.

String interning is purely an optimization. Python added the concept
to speed its own name lookups, and the rules it uses for
auto-interning are effective for that. It wasn't necessary to expose
the interning facilities to users to meet its goal, and, especially
since interned strings were originally immortal, it would have been a
horrible idea to intern all strings. The machinery was exposed just
because it's Pythonic to expose internals when reasonably possible.
There wasn't, and shouldn't be, an expectation that exposed internals
will be perfectly suited as-is to arbitrary applications.
However, I still think I'm going to try using the builtin 'intern' rather than my
own dict cache.

That's fine -- that's why it got exposed. Don't assume that any
string is interned unless you explicitly intern() it, and you'll be
happy (and it doesn't hurt to intern() a string that's already
interned -- you just get back a reference to the already-interned copy
then).

[earlier]
I can find documentation of a builtin called 'intern' but its use seems
frowned upon these days.

Not by me, but it's never been useful to *most* apps, apart from the
indirect benefits they get from Python's internal uses of string
interning. It's rare that an app really wants some strings stored
uniquely, and possibly never than an app wants all strings stored
uniquely. Most apps that use explicit string interning appear to be
looking for no more than a partial workalike for Lisp symbols.
 
P

Peter Otten

Mike Thompson said:
'==' won't help me, I'm afraid.

I need to improve the speed and memory footprint of an application which
reads in a very large XML document.

Yes, I should have read your post carefully. But I was preoccupied with
speed...
From your explanation there seems to be no language rules, just
implementation accidents.  And none of those will be particularly
helpful in my case.

With arbitrary strings the likelihood of a cache hit decreases fast. Using
your own dictionary and checking the refcounts could give you interesting
insights. Unfortunately there is no WeakDictionary with both keys and
values as weakrefs, so you have to do some work, or you will actually
_increase_ memory footprint.
However, I still think I'm going to try using the builtin 'intern'
rather than my own dict cache. That may provide an advantage, even if it
doesn't work with unicode.

You might at least choose an alias

my_intern = intern

then, lest you later regret that limitation.

Peter
 
P

Peter Otten

Martin v. Löwis said:
Peter said:
String constants that are potential attribute names are also interned:

Peter has explained all this correctly, but this aspect needs some
stressing perhaps: string *literals* that are potential attribute
names are also interned. This interning is done in the compiler,
when the code object is created, so strings not created by the compiler
are not interned.

[all strings are "constant", i.e. immutable, so the statement
above might have been confusing]

Yes, string "literal", not "constant" is the appropriate term for what I
meant.
For completeness here is an example demonstrating that names appearing as
"bare words" in the code are interned:
.... def __getattr__(self, name):
.... return name
....True

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top