F
Francis Girard
Hi,
For the first time in my programmer life, I have to take care of character
encoding. I have a question about the BOM marks.
If I understand well, into the UTF-8 unicode binary representation, some
systems add at the beginning of the file a BOM mark (Windows?), some don't.
(Linux?). Therefore, the exact same text encoded in the same UTF-8 will
result in two different binary files, and of a slightly different length.
Right ?
I guess that this leading BOM mark are special marking bytes that can't be, in
no way, decoded as valid text.
Right ?
(I really really hope the answer is yes otherwise we're in hell when moving
file from one platform to another, even with the same Unicode encoding).
I also guess that this leading BOM mark is silently ignored by any unicode
aware file stream reader to which we already indicated that the file follows
the UTF-8 encoding standard.
Right ?
If so, is it the case with the python codecs decoder ?
In python documentation, I see theseconstants. The documentation is not clear
to which encoding these constants apply. Here's my understanding :
BOM : UTF-8 only or UTF-8 and UTF-32 ?
BOM_BE : UTF-8 only or UTF-8 and UTF-32 ?
BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?
BOM_UTF8 : UTF-8 only
BOM_UTF16 : UTF-16 only
BOM_UTF16_BE : UTF-16 only
BOM_UTF16_LE : UTF-16 only
BOM_UTF32 : UTF-32 only
BOM_UTF32_BE : UTF-32 only
BOM_UTF32_LE : UTF-32 only
Why should I need these constants if codecs decoder can handle them without my
help, only specifying the encoding ?
Thank you
Francis Girard
Python tells me to use an encoding declaration at the top of my files (the
message is referring to http://www.python.org/peps/pep-0263.html).
I expected to see there a list of acceptable
For the first time in my programmer life, I have to take care of character
encoding. I have a question about the BOM marks.
If I understand well, into the UTF-8 unicode binary representation, some
systems add at the beginning of the file a BOM mark (Windows?), some don't.
(Linux?). Therefore, the exact same text encoded in the same UTF-8 will
result in two different binary files, and of a slightly different length.
Right ?
I guess that this leading BOM mark are special marking bytes that can't be, in
no way, decoded as valid text.
Right ?
(I really really hope the answer is yes otherwise we're in hell when moving
file from one platform to another, even with the same Unicode encoding).
I also guess that this leading BOM mark is silently ignored by any unicode
aware file stream reader to which we already indicated that the file follows
the UTF-8 encoding standard.
Right ?
If so, is it the case with the python codecs decoder ?
In python documentation, I see theseconstants. The documentation is not clear
to which encoding these constants apply. Here's my understanding :
BOM : UTF-8 only or UTF-8 and UTF-32 ?
BOM_BE : UTF-8 only or UTF-8 and UTF-32 ?
BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?
BOM_UTF8 : UTF-8 only
BOM_UTF16 : UTF-16 only
BOM_UTF16_BE : UTF-16 only
BOM_UTF16_LE : UTF-16 only
BOM_UTF32 : UTF-32 only
BOM_UTF32_BE : UTF-32 only
BOM_UTF32_LE : UTF-32 only
Why should I need these constants if codecs decoder can handle them without my
help, only specifying the encoding ?
Thank you
Francis Girard
Python tells me to use an encoding declaration at the top of my files (the
message is referring to http://www.python.org/peps/pep-0263.html).
I expected to see there a list of acceptable