UTF-8 and strings

Miles Bader · Jun 11, 2011

MikeP said:
I guess I have a hard time seeing how anything multi-byte is a boon. But,
and it's a big but (not to be confused with a phat azz!), if one doesn't
need "internationalization" (I mean other than English), it's a waste of
effort. Yes?

But that's the thing: if you're just doing things casually, but,
e.g., want to use a few special chars here and there, or allow users
more freedom in what filenames they're allowed to use, then UTF-8
_doesn't_ require much effort, it's a fairly easy tweak to ASCII-only
code. If the bulk of strings are English, then UTF-8 is also very
space-efficient.

It's UTF-16, which requires even the most trivial parts of
string-handling paths to be completely replaced, that's a pain in the
butt -- and then really offers almost no advantage to offset the
various disadvantages!

The only reasons I can see to use UTF-16, are: (1) you're writing
windows-only code, never expect to port it, and want to fit better
with windows library functions that expect UTF-16 strings, or (2)
you're writing an app to handle absolutely _massive_ amounts of CJK
text, and the space savings for CJK text in UTF-16 compared to UTF-8
are critical for you.

Very, very, few people are doing (2), so basically that leaves (1).

-Miles

John M. Dlugosz · Jun 13, 2011

I guess I have a hard time seeing how anything multi-byte is a boon. But,
and it's a big but (not to be confused with a phat azz!), if one doesn't
need "internationalization" (I mean other than English), it's a waste of
effort. Yes?

Since ASCII is a proper subset of UTF-8, you can write plain English
and get one byte per character. So there is no special effort on your
part.

It's rare that you would not care about internationalization. Even if
you don't plan to change your displayed UI into other languages,
people will try using file names and enter strings in their own
language.

John M. Dlugosz · Jun 13, 2011

But if you KNOW that all you need is what's in the BMP, why not exploit
that, right?

Sure, the project is specified to be nationalized into 7 languages,
and they all happen to be serviced by the Latin-1 character set. So
you decide to use 8-bit chars and assume the Windows program is
running on a system that uses code page 1252 as the default for a
process.

Then one day the boss comes in and says that the next version will be
marketed to China as well.

It is my experience that software projects only get more complex over
time. Plan for it, unless you are planning to be unsuccessful.

—John

Asger-P · Jun 13, 2011

Hi ruben

which is why it took 40 plus years to even think about it...

BTW - what you wrote is actually incorrect. I'm not an expert on utf-8
but god knows I've followed enough arguing about it, especially on Rik
Moens conspire mailing list, to understand this basic fact.

What part is incorrect ?

Have a look at:
http://en.wikipedia.org/wiki/UTF-8
and You will see that the first 127 characters are actually ASCII

Best regards
Asger-P

Asger-P · Jun 13, 2011

Hi ruben

Strangely enough, this is a specific problem for a specific kind of app,
like a word processor.

I think You are narrowing it a bit to much.
Most applications that interact with the user and their keyboard
need to consider codepages at some level, if they want to be used
outside the region where they were designed.

A simple thing like comparing two strings case insensitive will often
not work on non ASCII characters if You use the standard c functions.

This is an interesting page to read:
http://cppcms.sourceforge.net/boost_locale/html/appendix.html

If You live in a country where english is the language You
probably haven't seen the errors Your self, but fortunately
I live in Denmark so I have had to deal with this issue from
day one.

Best regards
Asger-P

Unicode (UTF-8) in C	13	Mar 16, 2014
UTF-8 vs w_char	48	Nov 3, 2013
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
CGI and UTF-8	14	Sep 28, 2009
StringScanner and UTF-8 in ruby 1.9	0	Sep 16, 2009
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011
ifstream >> string with UTF-8?	6	Sep 9, 2009
utf-8 and ctypes	5	Sep 28, 2010

UTF-8 and strings

Miles Bader

John M. Dlugosz

John M. Dlugosz

Asger-P

Asger-P

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads