A
Alexander Adam
Hi,
I am a bit list in encoding related stuff. Let me explain what I am
doing (yes it's C++ ):
I am getting some input content due Expat Xml Parser. I've setup Expat
to use wchar_t.
First question is this -- what is the difference of unsigned short,
wchar_t and char?
Okay, wchar_t is an built-in type of C++ and its two bytes of size
whereas char is always one byte.
But what's the real difference when storing Text into those types i.e.
ASCII, UTF-8, UTF-16 or UTF-32 encoded text?
Afaik, UTF-8 is 2 bytes, UTF-16 is 2 bytes and UTF-32 is up to four
bytes? Well anyway, my issue is how to correctly work with those
types. Internally I am using wchar_t for all my representations but
depending on the encoding I need to shift a current char value
bitwise, right?
Okay next one -- I am storing everything of my wchar_t array into a
stream of type char, doing so by a simple memcpy. Now how could I read
it back in? Say I have char* buffer where my wchar_t string is saved
in. I could surely do a simply memcpy(myWcharVar, buffer,
sizeof(wchar_t)) to get two bytes but this doesn't seem to be very
efficient as I'd like to read it char by char (like wchar_t nx =
buffer.next(), know what I mean?).
And then after having read such a char, I must be able to correctly
encode it. I know the encoding whether its ASCII, UTF-8, 16 or
anything but how would I go about it *without* using any big
libraries?
Thanks for *any* clarifications you could help out with on this topic,
Alex
I am a bit list in encoding related stuff. Let me explain what I am
doing (yes it's C++ ):
I am getting some input content due Expat Xml Parser. I've setup Expat
to use wchar_t.
First question is this -- what is the difference of unsigned short,
wchar_t and char?
Okay, wchar_t is an built-in type of C++ and its two bytes of size
whereas char is always one byte.
But what's the real difference when storing Text into those types i.e.
ASCII, UTF-8, UTF-16 or UTF-32 encoded text?
Afaik, UTF-8 is 2 bytes, UTF-16 is 2 bytes and UTF-32 is up to four
bytes? Well anyway, my issue is how to correctly work with those
types. Internally I am using wchar_t for all my representations but
depending on the encoding I need to shift a current char value
bitwise, right?
Okay next one -- I am storing everything of my wchar_t array into a
stream of type char, doing so by a simple memcpy. Now how could I read
it back in? Say I have char* buffer where my wchar_t string is saved
in. I could surely do a simply memcpy(myWcharVar, buffer,
sizeof(wchar_t)) to get two bytes but this doesn't seem to be very
efficient as I'd like to read it char by char (like wchar_t nx =
buffer.next(), know what I mean?).
And then after having read such a char, I must be able to correctly
encode it. I know the encoding whether its ASCII, UTF-8, 16 or
anything but how would I go about it *without* using any big
libraries?
Thanks for *any* clarifications you could help out with on this topic,
Alex