Le vendredi 3 janvier 2014 12:14:41 UTC+1, Robin Becker a écrit :
indeed
No, previously we had default of utf8 encoded strings in the lower levelsof the
code and we accepted either unicode or utf8 string literals as inputs to text
functions. As part of the port process we made the decision to change from
default utf8 str (bytes) to default unicode.
It's made no real difference to what we are able to produce or accept since utf8
or unicode can encode anything in the input and what can be produced depends on
fonts mainly.
I'm not sure if we have any non-bmp characters in the tests. Simple CJK etc etc
for the most part. I'm fairly certain we don't have any ability to handle
composed glyphs (multi-codepoint) etc etc
....
----
To Robin Becker
I know nothing about ReportLab except its existence.
Your story is very interesting. As I pointed, I know
nothing about the internal of ReportLab, the technical
aspects: the "Python part", "the used api for the PDF creation").
I have however some experience with the unicode TeX engine,
XeTeX, understand I'm understanding a little bit what's
happening behind the scene.
The very interesting aspect in the way you are holding
unicodes (strings). By comparing Python 2 with Python 3.3,
you are comparing utf-8 with the the internal "representation"
of Python 3.3 (the flexible string represenation).
In one sense, more than comparing Py2 with Py3.
It will be much more interesting to compare utf-8/Python
internals at the light of Python 3.2 and Python 3.3. Python
3.2 has a decent unicode handling, Python 3.3 has an absurd
(in mathematical sense) unicode handling. This is really
shining with utf-8, where this flexible string representation
is just doing the opposite of what a correct unicode
implementation does!
On the memory side, it is obvious to see it.
10020
On the performance side, it is much more complexe,
but qualitatively, you may expect the same results.
The funny aspect is that by working with utf-8 in that
case, you are (or one has) forcing Python to work
properly, but one pays on the side of the performance.
And if one wishes to save memory, one has to pay on the
side of performance.
In othe words, attempting to do what Python is
not able to do natively is just impossible!
I'm skipping the very interesting composed glyphs subject
(unicode normalization, ...), but I wish to point that
with the flexible string representation, one reaches
the top level of surrealism. For a tool which is supposed
to handle these very specific unicode tasks...
jmf