A
Alf P. Steinbach
After about a year of non-blogging I just posted about this: Unicode
console programs.
http://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/
It's interesting that there is not much whining about how Windows
consoles do not properly support international programs. Perhaps console
programs are not as popular as they once were? Perhaps students nowadays
start directly with GUI programs, in some other language?
Anyway, here's the summary in the posting:
<summary>
Above I introduced two approaches to Unicode handling in small Windows
console programs:
* The all UTF-8 approach where everything is encoded as UTF-8, and
where there are no BOM encoding markers.
* The wide string approach where all external text (including the
C++ source code) is encoded as UTF-8, and all internal text is encoded
as UTF-16.
The all UTF-8 approach is the approach used in a typical Linux
installation. With this approach a novice can remain unaware that he is
writing code that handles Unicode: it Just Works™ – in Linux. However,
we saw that it mass-failed in Windows:
* Input with active codepage 65001 (UTF-8) failed due to various bugs.
* Console output with Visual C++ produced gibberish due to the
runtime library’s attempt to help by using direct console output.
* I mentioned how wide string literals with non-ASCII characters
are incorrectly translated to UTF-16 by Visual C++ due to the necessary
lying to Visual C++ about the source code encoding (which is
accomplished by not having a BOM at the start of the source code file).
The wide string approach, on the other hand, was shown to have special
support in Visual C++, via the _O_U8TEXT file mode, which I called an
UTF-8 stream mode. But I mentioned that as of Visual C++ 10 this special
file mode is not fully implemented and/or it has some bugs: it cannot be
used directly but needs some scaffolding and fixing. That’s what part 2
is about.
</summary>
Cheers,
- Alf
console programs.
http://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/
It's interesting that there is not much whining about how Windows
consoles do not properly support international programs. Perhaps console
programs are not as popular as they once were? Perhaps students nowadays
start directly with GUI programs, in some other language?
Anyway, here's the summary in the posting:
<summary>
Above I introduced two approaches to Unicode handling in small Windows
console programs:
* The all UTF-8 approach where everything is encoded as UTF-8, and
where there are no BOM encoding markers.
* The wide string approach where all external text (including the
C++ source code) is encoded as UTF-8, and all internal text is encoded
as UTF-16.
The all UTF-8 approach is the approach used in a typical Linux
installation. With this approach a novice can remain unaware that he is
writing code that handles Unicode: it Just Works™ – in Linux. However,
we saw that it mass-failed in Windows:
* Input with active codepage 65001 (UTF-8) failed due to various bugs.
* Console output with Visual C++ produced gibberish due to the
runtime library’s attempt to help by using direct console output.
* I mentioned how wide string literals with non-ASCII characters
are incorrectly translated to UTF-16 by Visual C++ due to the necessary
lying to Visual C++ about the source code encoding (which is
accomplished by not having a BOM at the start of the source code file).
The wide string approach, on the other hand, was shown to have special
support in Visual C++, via the _O_U8TEXT file mode, which I called an
UTF-8 stream mode. But I mentioned that as of Visual C++ 10 this special
file mode is not fully implemented and/or it has some bugs: it cannot be
used directly but needs some scaffolding and fixing. That’s what part 2
is about.
</summary>
Cheers,
- Alf