support for UTF-8 in C language standard?

David Mathog · Nov 22, 2006

Does any standard C function support reading or writing UTF-8?
I'm not talking about the trivial case where the text is just the
ASCII subset of UTF-8. Rather, I'm referring to a hypothetical
function that could read UTF-8 when 2, 3, or even 4 byte encodings are
present and store the final unencoded character in, I guess, an array
of 32 bit integers.

I'm guessing that there _might_ be functions for this somewhere
in the C standard because trying to apply typical text manipulations
on a UTF-8 string directly seems to be quite messy and slow.
For instance, even a simple operation like "swap characters 1002->1005
with 2007->2010" would be a pain, you'd pretty much have to
parse from the beginning of the UTF-8 string
just to find the specified ranges, and then they might be different
numbers of bytes. So even though the number of characters is the same
they couldn't just be swapped byte for byte.

Thanks,

David Mathog

Mathias Gaunard · Nov 23, 2006

David said:
Does any standard C function support reading or writing UTF-8?

No.
UTF-8 is pretty simple though, and C code is available everywhere.

For instance, even a simple operation like "swap characters 1002->1005
with 2007->2010" would be a pain, you'd pretty much have to
parse from the beginning of the UTF-8 string
just to find the specified ranges, and then they might be different
numbers of bytes.

And how would having a standard function change that?

J. J. Farrell · Nov 23, 2006

David said:
Does any standard C function support reading or writing UTF-8?
I'm not talking about the trivial case where the text is just the
ASCII subset of UTF-8. Rather, I'm referring to a hypothetical
function that could read UTF-8 when 2, 3, or even 4 byte encodings are
present and store the final unencoded character in, I guess, an array
of 32 bit integers.

I'm guessing that there _might_ be functions for this somewhere
in the C standard because trying to apply typical text manipulations
on a UTF-8 string directly seems to be quite messy and slow.
For instance, even a simple operation like "swap characters 1002->1005
with 2007->2010" would be a pain, you'd pretty much have to
parse from the beginning of the UTF-8 string
just to find the specified ranges, and then they might be different
numbers of bytes. So even though the number of characters is the same
they couldn't just be swapped byte for byte.

Yes. Assuming your environment has a locale which supports UTF-8 and
whatever format you want the result in (UCS-4, presumably), then the
multibyte and wide chararcter functions should do what you want - see
mbtowc() and mbstowcs() for starters.

Stephen Sprunk · Nov 23, 2006

David Mathog said:
Does any standard C function support reading or writing UTF-8?
I'm not talking about the trivial case where the text is just the
ASCII subset of UTF-8. Rather, I'm referring to a hypothetical
function that could read UTF-8 when 2, 3, or even 4 byte encodings are
present and store the final unencoded character in, I guess, an array
of 32 bit integers.

I'm guessing that there _might_ be functions for this somewhere
in the C standard because trying to apply typical text manipulations
on a UTF-8 string directly seems to be quite messy and slow.

The locale support somewhat addresses this; unfortunately, locale names
are not standardized so your program still won't be portable in practice
even if the code is technically portable. However, if you can find the
right locale on your system, it's possible to use C's standard functions
to turn an input stream into an array of wchar_t's, manipulate them as
desired, and output them again as UTF-8.

<OT>There are a number of third-party libraries that provide a specific
set of conversions including UTF-8, such as libiconv. However, those
libraries are not part of the C Standard itself and thus not portable
either.</OT>

S

Unicode (UTF-8) in C	13	Mar 16, 2014
UTF-8 and strings	44	Jun 7, 2011
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
printf and UTF-8 in linux	6	Sep 18, 2009
utf-8 and ctypes	5	Sep 28, 2010
UTF-8 support - still stuck	9	Mar 5, 2011
C language now truly universal	0	Jan 1, 2011
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	58	Sep 29, 2013

support for UTF-8 in C language standard?

David Mathog

Mathias Gaunard

J. J. Farrell

Stephen Sprunk

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads