C++, wchar_t, Unicode and all that stuff

gamehack · Dec 23, 2005

Hi all,

I was doing a bit of research about writing yet another build tool but
that's not the point of my mail. I'm going to ask a few questions about
how to resolve a few internationalization problems and I'm sorry if
this
is not the right mailing list - I couldn't find any other which was
suited(since my goal is to resolve the problems in a
platform-independent way). The goal is - being able to deal with
different encodings on different platforms with no problems in a
portable fashion. After reading a few articles on the net I realized
that everything boils down to the character size. The problem is
separated into how you manage the chars/strings internally and
externally.

Internally(the way they are put in the source code files and what types
they are stored in):

Using wchar_t:
Basically using wchar_t as the fundamental character type(AFAIK it is
2-4 bytes depending on the platform) and using all correspondent w
functions and streams. The problem is what to do if there is no OS
function which accepts wchar_t. Then I would need to write my own
library to handle the proper conversions(not sure if simple type casts
would do the job). And wchar_t is not said to be in any particular
encoding so I'm bit confused about that. If I write in a source file
wchar_t* st = "something"; what encoding would it be stored as? And
what
about wchar_t* st = L"something"; UTF-8?

Using UTF-8:
I've not seen any articles on how do this(except suggestions to use
long
unsigned to store the chars but what about conversions and passing
strings to OS functions?)

Externally(OS interfaces):
I've completely no idea how to handle this. When you write e.g.
main(int
argc, char** argv) what happens if they pass the arguments as UTF-8
strings? How do you handle that? How do you handle conversion back/from
the internal representation(writing your own library?) Is there
actually
a portable way of doing it?

I'm sorry if this is not the right place to ask these questions but I'm
completely puzzled and thought you guys will be able to point me out to
the right direction. As I said the only thing which I need is to be
able
to communicate with the OS in a transparent manner without worrying
about the encoding and being able to use the future program in complete
UTF-8 environments so any valid UTF-8 could be passed etc. Any
comments/directions/remarks are greatly appreciated.

Regards,
gamehack

Guest · Dec 23, 2005

gamehack said:
If I write in a source file
wchar_t* st = "something"; what encoding would it be stored as? And
what
about wchar_t* st = L"something"; UTF-8?

Let me to quote one of post by Ulrich Eckhardt (from
microsoft.public.windowsce.embedded.vc), here is complete thread
so you can get a better overview of the problem I asked -
http://tinyurl.com/dbhyj:

"It is invalid C or C++ to embed these characters*** into sourcecode.
You are relying on compiler-specific support.
That said, there is a #pragma to tell MSC which codepage you're using."

*** - here Ulrich talks about polish characters I embedded in my code

Using UTF-8:
I've not seen any articles on how do this(except suggestions to use
long
unsigned to store the chars but what about conversions and passing
strings to OS functions?)

"Chapter 2 -An Introduction to Unicode" from following book may be
helpful: http://www.charlespetzold.com/pw5/index.html

Finally, I saw may posts on usenet about how to handle
Unicode/non-Unicode in the same program, etc. and what I can say is that
there seems to be no one and the best solution.
Mainly, I develop for Windows CE platform and I try to follow Charles
Petzold's suggestions presented in his book and it works (but I don't
know if it would work on Unix, because on Unix I hardly ever use Unicode).

Cheers

Axter · Dec 23, 2005

gamehack said:
Hi all,

I was doing a bit of research about writing yet another build tool but
that's not the point of my mail. I'm going to ask a few questions about
how to resolve a few internationalization problems and I'm sorry if
this
is not the right mailing list - I couldn't find any other which was
suited(since my goal is to resolve the problems in a
platform-independent way). The goal is - being able to deal with
different encodings on different platforms with no problems in a
portable fashion. After reading a few articles on the net I realized
that everything boils down to the character size. The problem is
separated into how you manage the chars/strings internally and
externally.

Internally(the way they are put in the source code files and what types
they are stored in):

Using wchar_t:
Basically using wchar_t as the fundamental character type(AFAIK it is
2-4 bytes depending on the platform) and using all correspondent w
functions and streams. The problem is what to do if there is no OS
function which accepts wchar_t. Then I would need to write my own
library to handle the proper conversions(not sure if simple type casts
would do the job). And wchar_t is not said to be in any particular
encoding so I'm bit confused about that. If I write in a source file
wchar_t* st = "something"; what encoding would it be stored as? And
what
about wchar_t* st = L"something"; UTF-8?

Using UTF-8:
I've not seen any articles on how do this(except suggestions to use
long
unsigned to store the chars but what about conversions and passing
strings to OS functions?)

Externally(OS interfaces):
I've completely no idea how to handle this. When you write e.g.
main(int
argc, char** argv) what happens if they pass the arguments as UTF-8
strings? How do you handle that? How do you handle conversion back/from
the internal representation(writing your own library?) Is there
actually
a portable way of doing it?

Both the C and C++ standards have a portable function for convert ANSI
and wide charater strings.
Check your man page for wcstombs and mbstowcs.
Example code:
wifstream wide_file(FileWithWideChar);
wstring TmpLineData;
string CmpFileData_InAnsi, AnsiTmpLine;
while(getline(wide_file, TmpLineData))
{
AnsiTmpLine.resize(TmpLineData.size(), 0);
wcstombs(AnsiTmpLine.begin(), TmpLineData.begin(),
TmpLineData.size());
CmpFileData_InAnsi += AnsiTmpLine + "\n";
}

ofstream ansi_file(FileWithAnsiChar);
ansi_file.write(CmpFileData_InAnsi.begin(),
CmpFileData_InAnsi.size());
ansi_file << endl;

gamehack · Dec 23, 2005

But how do I know what encoding the input from the OS is?

Here to create a stand alone folder to hold files without all the left side stuff or extra stuff	2	Jan 2, 2023
wchar_t is useless	18	Nov 21, 2011
Lost in encoding stuff	3	Jan 16, 2008
Hi, I am a webflow user. I am looking for CSS code that can KEEP ALL ELEMENTS POSITIONED in the SAME spot across all resolutions	0	Oct 27, 2023
wchar_t	3	Jun 8, 2006
How to loop in folder through all excel files and all sheets using pandas?	0	Dec 1, 2022
Unicode (UTF-8) in C	13	Mar 16, 2014
std::wstringbuf and imbue to convert from utf-8 to wchar_t?	3	Nov 2, 2008

C++, wchar_t, Unicode and all that stuff

gamehack

Guest

Axter

gamehack

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads