What to prefer - TCHAR arrays, std::string or std::wstring ?

R

rohitpatel9999

Hi

While developing any software, developer need to think about it's
possible enhancement for international usage and considering UNICODE.

I have read many nice articles/items in advanced C++ books (Effective
C++, More Effective C++, Exceptional C++, More Exceptional C++, C++
FAQs, Addison Wesley 2nd Edition)

Authors of these books have not considered UNICODE. So many of their
suggestions/guidelines confuse developers regarding what to use for
character-string members of class (considering exception-safety,
reusability and maintenance of code).

Many books have stated that:
Instead of using character arrays, always prefer using std::string.

My Questions is:

While developing generic Win32 app using C++ for Windows
(98/NT/2000/2003/XP), considering unicode for Windows NT/2000/2003/XP,
What to prefer - TCHAR arrays, std::string or std::wstring
for character-string members (name, address, city, state, country etc.)

of classes like Address, Customer, Vendor, Employee ?

What to prefer - TCHAR arrays, std::string or std::wstring ?

I truly appreciate any help or guideline.
Anand
 
M

Marcus Kwok

My Questions is:

While developing generic Win32 app using C++ for Windows
(98/NT/2000/2003/XP), considering unicode for Windows NT/2000/2003/XP,
What to prefer - TCHAR arrays, std::string or std::wstring
for character-string members (name, address, city, state, country etc.)

of classes like Address, Customer, Vendor, Employee ?

What to prefer - TCHAR arrays, std::string or std::wstring ?

I truly appreciate any help or guideline.

Standard C++ does not know about the TCHAR type (I know what it
represents, but it is not a standard language feature), and formally
also does not know about Unicode (std::wstring isn't quite Unicode).
Handling Unicode can be a complex topic, and one on which I cannot claim
to be well versed in.

Your question is probably better suited for a Windows newsgroup.
 
P

Phlip

rohitpatel9999 said:
While developing any software, developer need to think about it's
possible enhancement for international usage and considering UNICODE.

Negative. Programmers must prepare for _anything_. The requirement for
Unicode may or may not come next.

Prepare for anything by writing copious unit tests, and by folding as much
duplication as possible. If you duplicate the word "the" in two strings,
fold them into one.

If you then need to localize, read this:

http://flea.sourceforge.net/TFUI_localization.doc

Then incrementally move your strings into a pluggable resource, and
incrementally widen or convert your string variables. "Incrementally" means
one at a time, passing all tests after each small edit.

The myth that some important decisions must be made early, to avoid the cost
of a late change, is a self-fulfilling prophecy of defeat.
Authors of these books have not considered UNICODE. So many of their
suggestions/guidelines confuse developers regarding what to use for
character-string members of class (considering exception-safety,
reusability and maintenance of code).

Right. They all use std::string, because many programmers learned C first,
where a character array is still the simplest and most robust way to
represent a fixed-length string. So std::string should be the default,
without a real reason to use anything else. Such a reason could then switch
you to TCHAR, or to std::wstring, or to something else.
My Questions is:

While developing generic Win32 app using C++ for Windows
(98/NT/2000/2003/XP), considering unicode for Windows NT/2000/2003/XP,
What to prefer - TCHAR arrays, std::string or std::wstring
for character-string members (name, address, city, state, country etc.)

Tell your "customer liaison", the person authorized to request features, if
you should spend 9 days working on their next feature, or 18 days working on
that feature + internationalization.

If they need only English, then use std::string everywhere you possibly can,
and something like CString for the remainder.

When they schedule a port to another language, you obtain a glossary for
that language _first_. Then you refactor your code to use something like
std::basic_string<TCHAR>.

If you truly need TCHAR in its WCHAR mode, then you must configure your
tests to run (and pass) with the _UNICODE version of your binary. You should
always pass all such tests, each time you change anything. Otherwise you
might make an innocent change that works in one mode, but breaks in another.

Further, not all code-pages can use WCHAR or wchar_t. Spanish, for example,
is the same code-page as English. Greek is a different code-page, but it
still uses 8-bit bytes. So you should only enable the few features you need
to support another language, and not all those languages need Unicode. Some
versions of Chinese don't need it.

If you truly need "one binary that presents all languages, mixed together",
then you need Unicode. And if you need a rare language like Sanskrit or
Inuit, that has no independent 8-bit code-page, then you will need Unicode.
Otherwise you probably don't.

From here, you must read a book on internationalization. Yet you don't do
_any_ of that research until your business side has selected a target
language. Otherwise you will just be writing speculative features that
_might_ work with any language.

So default to std::string, and keep your programming velocity high. That
helps ensure that your clients will be _able_ to eventually target the
international markets...
 
R

rohitpatel9999

Thank you for helpful suggestions.
Suggestion of using std::basic_string<TCHAR> is also good.

Client is sure that they will need UNICODE for few languages (e.g.
Japanese).
Client req. document did specify to make code C++ generic for UNICODE
consideration (but should not use MFC specific CString).

So (in Microfost Visual C++)
application build for Win98/ME will have MBCS defined
application build for Win2000/NT/2003/XP will have UNICODE and _UNICODE
defined.

Please guide me, (considering exception-safety, reusability and
maintenance of code).

What to prefer - TCHAR arrays, std::string or std::wstring ?

or Which of the following three classes is preferable ?

e.g.

/* Option 1 */
class Address
{
_TCHAR name[30];
_TCHAR addressline1[30];
_TCHAR addressline2[30];
_TCHAR city[30];
}


/* Option 2 */
class Address
{
std::basic_string<TCHAR> name;
std::basic_string<TCHAR> addressline1;
std::basic_string<TCHAR> addressline2;
std::basic_string<TCHAR> city;
}


/* Option 3 */
#ifdef UNICODE
typedef std::wstring tstring
#else
typedef std::string tstring
#endif
class Address
{
tstring name;
tstring addressline1;
tstring addressline2;
tstring city;
}

Thanks again.
Anand (Rohit)
 
?

=?iso-8859-1?q?Kirit_S=E6lensminde?=

Hi

While developing any software, developer need to think about it's
possible enhancement for international usage and considering UNICODE.

I have read many nice articles/items in advanced C++ books (Effective
C++, More Effective C++, Exceptional C++, More Exceptional C++, C++
FAQs, Addison Wesley 2nd Edition)

Authors of these books have not considered UNICODE. So many of their
suggestions/guidelines confuse developers regarding what to use for
character-string members of class (considering exception-safety,
reusability and maintenance of code).

Many books have stated that:
Instead of using character arrays, always prefer using std::string.

My Questions is:

While developing generic Win32 app using C++ for Windows
(98/NT/2000/2003/XP), considering unicode for Windows NT/2000/2003/XP,
What to prefer - TCHAR arrays, std::string or std::wstring
for character-string members (name, address, city, state, country etc.)

of classes like Address, Customer, Vendor, Employee ?

What to prefer - TCHAR arrays, std::string or std::wstring ?

I truly appreciate any help or guideline.
Anand

I don't use TCHAR as it's a horrid kludge and has problems of its own.
Although it pretends to support both wchar_t and char it's slightly
broken. The _T macro that may or may not put the L in front of string
literals is even more broken.

As you're developing on Windows then just use wchar_t (and tell MSVC to
define it as a base type, not a typedef to short). You will get exactly
zero benefit from trying to compile the same program with and without
Unicode support.

It is normally much better to just use Unicode internally and then
convert to eight bit in whatever localised form you need when you have
to do so. You will find that you have to do all of this anyway for any
non-trivial program.


K
 
P

Phlip

rohitpatel9999 said:
Client is sure that they will need UNICODE for few languages (e.g.
Japanese).

There are requirements and then there are requirements.

I once ported an application to Greek. The original author had added lots of
calls to convert between code-pages. Then the program never converted to any
code pages - it all worked in Western Europe with just one code-page.

I had a lot of fun diagnosing and fixing each bug, the first time any of
these conversion functions ever got called. Oh, and I was implicitly blamed
for the slow velocity, not the original programmer.

So, has this client arranged to provide a real Japanese locale, with a
glossary, for you to port the app to _now_?

Without the critical step of actually using this speculative code, the
client will instead order you to waste time twice, now when you proactively
code for Unicode, and later when you actually provide a new locale.
Client req. document did specify to make code C++ generic for UNICODE
consideration (but should not use MFC specific CString).

So (in Microfost Visual C++)
application build for Win98/ME will have MBCS defined
application build for Win2000/NT/2003/XP will have UNICODE and _UNICODE
defined.

Please guide me, (considering exception-safety, reusability and
maintenance of code).

From here on, I can't. The question is now only on-topic for, roughly,
, or possibly a localization forum
thereof. However, MBCS might provide for as much Japanese as UNICODE would.
You need to ask your client for a real Japanese locale, and then you need to
match your work to it. (And don't get me started about UCS.)

If they give you a glossary in the JIS201 code-page, then an 8-bit non-MBCS
would work for both the Win95s and the WinNTs. If you first enabled UNICODE,
and only then discover your glossary is in JIS201, then you would have
wasted that effort.

(You could use iconv to convert the glossary to UNICODE or back. The goal is
to match which code-page Japanese customers will accept. Has your client
actually researched this?)
What to prefer - TCHAR arrays, std::string or std::wstring ?

Joel Spolky sez "there's no such thing as raw text". The rejoinder is that
wchar_t does not a localized application make.

If you need UNICODE, and if you truly need to pack all kinds of text into
any string, then you need a kind of UTF to encode it. UNICODE is a character
set, not an encoding. And if you can go with UTF-8, even on a Win95 machine,
then you don't need std::wstring.
_TCHAR name[30];

Never. The fixed-length string itself will cause untold horror.
std::basic_string<TCHAR> name;

Only if you actually test both modes, as you program.

And please introduce a typedef:

typedef std::basic_string said:
/* Option 3 */
#ifdef UNICODE
typedef std::wstring tstring

This is a clumsy version of Option 2.

The next complaint is that neither wchar_t or WCHAR are "UNICODE". Sometimes
they are UTF-16. (And on some compilers wchar_t is UTF-32.)

The more you seek a simple answer, the harder this problem will get. The
answer would be simple if you had enough evidence to back up your decision.
Always get as much evidence as possible - preferrably from live deployed
code - before making hard and irreversible decisions. Your client clearly
has experience with source code that created problems when it localized.
They _cannot_ fix this by just guessing you will need the _UNICODE flag
turned on. You must work with them to either defer the requirement, and
write clean code, or promote the requirement, targetting a real release
candidate that a real international user will accept.
 
P

Phlip

Kirit said:
As you're developing on Windows then just use wchar_t (and tell MSVC to
define it as a base type, not a typedef to short). You will get exactly
zero benefit from trying to compile the same program with and without
Unicode support.

Except that turning on _UNICODE will automagically make the compiler and
program interpret your RC file in UTF-16 instead of a code-paged 8-bit
encoding.
It is normally much better to just use Unicode internally and then
convert to eight bit in whatever localised form you need when you have
to do so. You will find that you have to do all of this anyway for any
non-trivial program.

The OP also has the requirement to target the Win95s, which can't run in
Wide mode.

Aren't there strap-on DLL sets that provide a kind of Wide mode for the
Win95s? If so, the OP could deploy these with the application, build
everything for UNICODE, and safely neglect to enable any other code-pages.
 
L

loufoque

What to prefer - TCHAR arrays, std::string or std::wstring ?

Just make anything Unicode-aware without using any specific stupidity
from the win32 API.
However, if you rely heavily on that API it may be annoying to interface
with it if you don't follow its internationalization concepts.
But anyway if you rely that much on it you're coding something so
specific that you should ask in another group.

std::wstring will allow UCS-2 (on win32) and UCS-4 (on most unices).
You can use std::string for 'unsafe' utf-8, which is in most of the
cases enough.

Or you could use ICU or glibmm for advanced Unicode support.
 
B

Bo Persson

Phlip said:
Except that turning on _UNICODE will automagically make the compiler
and program interpret your RC file in UTF-16 instead of a code-paged
8-bit encoding.

You can turn that option on as well, if it has any advantage. Using
wchar_t and std::wstring in your application makes it independent of
those settings.
The OP also has the requirement to target the Win95s, which can't
run in Wide mode.

Windows 95, 98, and NT are officially unsupported both as OSs and as
targets for the present compiler. All currently supported Windows
versions use wchar_t internally. New applications could do that as
well.

Using TCHAR to optionally compile a new application for a dead OS
doesn't seem very useful to me. :)
Aren't there strap-on DLL sets that provide a kind of Wide mode for
the Win95s? If so, the OP could deploy these with the application,
build everything for UNICODE, and safely neglect to enable any other
code-pages.

Except that these are as dead as their OSs. Can't be distributed after
their end-of-life.


Bo Persson
 
L

loufoque

Phlip wrote :
The OP also has the requirement to target the Win95s, which can't run in
Wide mode.

Actually, you can probably do it with MSLU (the Microsoft Layer for
Unicode on Windows 95, 98, and Me systems)
 
P

Phlip

Bo said:
Windows 95, 98, and NT are officially unsupported both as OSs and as
targets for the present compiler. All currently supported Windows versions
use wchar_t internally. New applications could do that as well.

Nice to know, but I use "Win95s" to refer to the lineage, up to ME, and
WinNTs for versions up to Win2005 or whatever.
Using TCHAR to optionally compile a new application for a dead OS doesn't
seem very useful to me. :)

The OP seems to have a requirements bottleneck. Sometimes a client will
over-specify everything, hoping to keep their options open. Narrow
requirements and clean code will do that better than guessing that the
program must someday port to a Win95-derived platform.

Is WinME officially dead?
Except that these are as dead as their OSs. Can't be distributed after
their end-of-life.

You mean MS makes packaging an unsupported DLL illegal? They retract its
license or something? Don't they know the 17th Rule of Acquisition is "A
contract is a contract"?

Regardless, if the client actually needs to target the home market, they
must start with MS's official definition of that market.

Turning on UNICODE will make all OS strings wide, and will turn on UTF-16.
Hence, go with std::wstring, hard-coded, everywhere.
 
B

Bo Persson

Phlip said:
Nice to know, but I use "Win95s" to refer to the lineage, up to ME,
and WinNTs for versions up to Win2005 or whatever.


The OP seems to have a requirements bottleneck. Sometimes a client
will over-specify everything, hoping to keep their options open.
Narrow requirements and clean code will do that better than guessing
that the program must someday port to a Win95-derived platform.

Is WinME officially dead?

It is still supported I guess, but it never worked very well. Was sort
of a downgrade from Windows 98 - nothing much new, just more unstable.
:)
You mean MS makes packaging an unsupported DLL illegal? They retract
its license or something? Don't they know the 17th Rule of
Acquisition is "A contract is a contract"?

From what I know, MS has removed it from their servers so you cannot
get it legitimately anymore. If you already use it and continue to
distribute it, they will probably not sue. If you have a problem
though, what happens?
Regardless, if the client actually needs to target the home market,
they must start with MS's official definition of that market.

Turning on UNICODE will make all OS strings wide, and will turn on
UTF-16. Hence, go with std::wstring, hard-coded, everywhere.

Right.


Bo Persson
 
P

Phlip

Bo said:

Then, per my lecture on requirements, neither compile for nor use any 8-bit
mode, or std::string. Never leave a "flavor" of a program that's full of
bugs and nasty surprises, expecting that it "might be useful someday".
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,008
Messages
2,570,271
Members
46,874
Latest member
CyberGateway

Latest Threads

Top