regenerating unicodedata for py2.7 using py3 makeunicodedata.py?

V

Vlastimil Brom

Hi all,
I'd like to ask about a surprising possibility I found while
investigating the new unicode 6.0 standard for use in python.
As python 2 series won't be updated in this regard
( http://bugs.python.org/issue10400 ),
I tried my "poor man's approach" of compiling the needed pyd file with
the recent unicode data (cf. the older post
http://mail.python.org/pipermail/python-list/2010-March/1240002.html )
While checking the changed format, i found to my big surprise, that it
is possible to generate the header files using the py3
makeunicodedata.py
which has already been updated for Unicode 6.0; this is even much more
comfortable than the previous versions, as the needed data are
downloaded automatically.
http://svn.python.org/view/python/b.../makeunicodedata.py?view=markup&pathrev=85371
It turned out, that the resulting headers are accepted by MS Visual
C++ Express along with the py2.7 source files
and that the generated unicodedata.pyd seems to be working work at
least in the cases I tested sofar.

Is this intended or even guaranteed for these generated files to be
compatible across py2.7 and py3, or am I going to be bitten by some
less obvious issues later?

The newly added ranges and characters are available, only in the CJK
Unified Ideographs Extension D the character names are not present
(while categories are), but this appears to be the same in the
original unicodedadata with 5.2 on CJK Unified Ideographs Extension C.
Traceback (most recent call last):
File said:
###########################
Traceback (most recent call last):

Could please anybody confirm, whether this way of updating the
unicodedata for 2.7 is generaly viable or point out possible problem
this may lead to?
Many thanks in advance,
Vlastimil Brom
 
M

Martin v. Loewis

Is this intended or even guaranteed for these generated files to be
compatible across py2.7 and py3, or am I going to be bitten by some
less obvious issues later?

It works because the generated files are just arrays of structures,
and these structures are the same in 2.7 and 3.2. However, there is
no guarantee about this property: you will need to check for changes
to unicodedata.c to see whether they may affect compatibility.

Regards,
Martin
 
V

Vlastimil Brom

2010/11/13 Martin v. Loewis said:
It works because the generated files are just arrays of structures,
and these structures are the same in 2.7 and 3.2. However, there is
no guarantee about this property: you will need to check for changes
to unicodedata.c to see whether they may affect compatibility.

Regards,
Martin

Thanks for the confirmation Martin!

Do you think, it the mentioned omission of the character names of some
CJK ranges in unicodedata intended, or should it be reported to the
tracker?

Regards,
Vlastimil Brom
 
M

Martin v. Loewis

Thanks for the confirmation Martin!

Do you think, it the mentioned omission of the character names of some
CJK ranges in unicodedata intended, or should it be reported to the
tracker?

It's certainly a bug. So a bug report would be appreciated, but much
more so a patch. Ideally, the patch would either be completely
forward-compatible (should the CJK ranges change in future Unicode
versions),
or at least have a safe-guard to detect that the data file is getting
out of sync with the C implementation.

Regards,
Martin
 
M

Martin v. Loewis

Thanks for the confirmation Martin!

Do you think, it the mentioned omission of the character names of some
CJK ranges in unicodedata intended, or should it be reported to the
tracker?

It's certainly a bug. So a bug report would be appreciated, but much
more so a patch. Ideally, the patch would either be completely
forward-compatible (should the CJK ranges change in future Unicode
versions),
or at least have a safe-guard to detect that the data file is getting
out of sync with the C implementation.

Regards,
Martin
 
V

Vlastimil Brom

2010/11/18 Martin v. Loewis said:
It's certainly a bug. So a bug report would be appreciated, but much
more so a patch. Ideally, the patch would either be completely
forward-compatible (should the CJK ranges change in future Unicode
versions),
or at least have a safe-guard to detect that the data file is getting
out of sync with the C implementation.

Regards,
Martin
Thanks,
I just created a bug ticket:
http://bugs.python.org/issue10459

The omissions of character names seem to be:

é¾¼ (0x9fbc) - é¿‹ (0x9fcb)
(CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff])

𪜀 (0x2a700) - 𫜴 (0x2b734)
(CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f])

ð«€ (0x2b740) - ð«  (0x2b81d)
(CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f])

(Also the unprintable ASCII controls, Surrogates and Private use area,
where the missing names are probably ok.)

Unfortunately, I am not able to provide a patch, mainly because of
unicodadate being C code.
A while ago I considered writing some unicodedata enhancements in
python, which would support the ranges and script names, full category
names etc., but sofar the direct programatic lookups in the online
unicode docs and with some simple processing also do work
sufficiently...

Regards,
Vlastimil Brom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top