string v.s. basic_string

G

George2

Hello everyone,


I would like to learn some experiences about when should we use
std::basic_string and when should we use std::string?

I learned some Hello World level samples and now want to listen to
your practical experiences. :)


thanks in advance,
George
 
V

Victor Bazarov

George2 said:
I would like to learn some experiences about when should we use
std::basic_string and when should we use std::string?

I learned some Hello World level samples and now want to listen to
your practical experiences. :)

I almost never had to use 'basic_string'. 'std::string' is a typedef
for 'std::basic_string<char ...>' (where ... represents some stuff
related to 'char'). If you deal with 'char', 'std::string' is enough.
If you deal with wide char ('wchar_t'), 'std::wstring' is the class.
I heard that Unicode is never well served by either of those, so folks
have their own custom classes for Unicode, I guess.

An example when I did have to use 'std::basic_string' is the library
where some functions were specialised for 'char' and 'wchar_t' (they
had to do that to use some non-overloaded functions) and just for the
completion's sake, I added a specialisation for 'unsigned char' and
'signed char' (since those are different from 'char'), and utilised
some casts internally.

V
 
M

Massimo

George2 said:
Hello everyone,


I would like to learn some experiences about when should we use
std::basic_string and when should we use std::string?

I learned some Hello World level samples and now want to listen to
your practical experiences. :)


thanks in advance,
George
In most cases, we use std::string. I've never used std::basic_string.
 
J

Johannes Bauer

Victor said:
I heard that Unicode is never well served by either of those, so folks
have their own custom classes for Unicode, I guess.

Is anything planned for the future of C++?

Handling UTF-8 *correctly* is awkwardly difficult, especially since
different characters (not in the C sense of the word) may differ greatly
in length, depending on the type of character.

Some std::utf8string would be really nice.

Greetings,
Johannes
 
J

James Kanze

Victor Bazarov schrieb:
Is anything planned for the future of C++?

There will be new types, guaranteed to be UTF-16 and UTF-32.
Handling UTF-8 *correctly* is awkwardly difficult, especially
since different characters (not in the C sense of the word)
may differ greatly in length, depending on the type of
character.

That seems to be a widespread myth. In my experience, it's
rarely a problem. And the same thing is true for UTF-16, and
even UTF-32.
 
J

Johannes Bauer

James said:
There will be new types, guaranteed to be UTF-16 and UTF-32.

Well, those two are easy because of their fixed character lenghts.
That seems to be a widespread myth. In my experience, it's
rarely a problem. And the same thing is true for UTF-16, and
even UTF-32.

It's really not something I plainly assumed, it's something I've already
*implemented* myself. And it was a pain. Especially if you want halfways
decent recognition of whitespace (UTF-8 provides "hard spaces" etc.) or
quotation marks (the regular " is easy, but there's „ (that one in upper
too) and french »« ones). Saying "I want the third character" isn't
trivial anymore since there is no direct mapping of characters to bytes.
Searching every time in a long text is of order O(n) - compared to O(1)
with ISO8859-1/UTF-16/UTF-32 strings. Therefore one has to cleverly
implement this so searching in large texts doesn't take too much time.

Just thinking of it upsets my stomach a little, seriously. It ain't pretty.

Greetings,
Johannes
 
E

Erik Wikström

Well, those two are easy because of their fixed character lenghts.

Actually only UTF-32 have fixed character length, thought I think that
most characters commonly in use in the world can be represented in one
16-bit character (unless combining characters).
It's really not something I plainly assumed, it's something I've already
*implemented* myself. And it was a pain. Especially if you want halfways
decent recognition of whitespace (UTF-8 provides "hard spaces" etc.) or
quotation marks (the regular " is easy, but there's „ (that one in upper
too) and french »« ones). Saying "I want the third character" isn't
trivial anymore since there is no direct mapping of characters to bytes.
Searching every time in a long text is of order O(n) - compared to O(1)
with ISO8859-1/UTF-16/UTF-32 strings. Therefore one has to cleverly
implement this so searching in large texts doesn't take too much time.

Just thinking of it upsets my stomach a little, seriously. It ain't pretty.

There is a reason that test is often internally stored as UTF-16 instead
of UTF-8. Examples of such systems are Windows, Java, .Net, and OS X.
 
C

Cholo Lennon

Hello everyone,

I would like to learn some experiences about when should we use
std::basic_string and when should we use std::string?

I learned some Hello World level samples and now want to listen to
your practical experiences. :)

In general, like Victor said, you have never need to use basic_string
directly. BTW, like guys of microsoft VC newsgroup told you, at least
in Windows world is a common practice to use basic_string with TCHAR
to maintain backward compatibility with older versions of Windows that
don't support unicode strings.


Regards
 
J

James Kanze

Actually only UTF-32 have fixed character length, thought I
think that most characters commonly in use in the world can be
represented in one 16-bit character (unless combining
characters).

Even UTF-32 doesn't have fixed character length. And unless
you've verified that you have one of the canonical
representations, the same character may have two different
representations, one a single character, and the other a
multi-word character. Thus, for example, a small latin letter o
with circumflex accent may be represented either as a single
word (L"\u00F4") or as two (L"\u006F\u0302"). Depending on
which normalized form is used, only one or the other would be
allowed, and if you're not using normalized forms, your code
must be prepared to handle both, and treat them as the same
character. (Regardless of the normalized form, some characters
will require a composite representation. Such characters are
rare in NFC, however.)

The result is that you have to handle variable length characters
anyway.
 
J

James Kanze

James Kanze schrieb:
Well, those two are easy because of their fixed character lenghts.

No they're not.
It's really not something I plainly assumed, it's something I've already
*implemented* myself. And it was a pain. Especially if you want halfways
decent recognition of whitespace (UTF-8 provides "hard spaces" etc.) or
quotation marks (the regular " is easy, but there's ? (that one in upper
too) and french >><< ones). Saying "I want the third character" isn't
trivial anymore since there is no direct mapping of characters to bytes.

But exactly the same problems affect UTF-16 and UTF-32. Some
characters require more than one element; some characters may
have multiple representations, etc., etc.

Implementing good character handling, anytime you go beyond the
basic character set, is difficult. I agree. But using UTF-8
doesn't make it substantially any more difficult, and generally
results in much faster code because of better locality. In
fact, I switched from UTF-32 to UTF-8 because it ended up
simpler.
Searching every time in a long text is of order O(n) -
compared to O(1) with ISO8859-1/UTF-16/UTF-32 strings.
Therefore one has to cleverly implement this so searching in
large texts doesn't take too much time.
Just thinking of it upsets my stomach a little, seriously. It
ain't pretty.

Having implemented complicated character handling code in both
UTF-32 and UTF-8, I can assure you that if you do it correctly,
UTF-8 is no more difficult (and maybe even a little bit easier)
than UTF-32. The only real difference is that if you do it
incorrectly, you'll probably hit the problem immediately (even
with purely English text) with UTF-8, where as you'll only hit
it in exceptional cases with UTF-32. (And of course, it really
depends on what you are doing. If you limit yourself to NFC and
European languages, UTF-16/UTF-32 is a probably simpler for
something like an editor, but UTF-8 would still be simpler for
parsing.)
 
A

Alf P. Steinbach

* James Kanze:
No they're not.



But exactly the same problems affect UTF-16 and UTF-32. Some
characters require more than one element; some characters may
have multiple representations, etc., etc.

Could you please elaborate on the problem with more than one encoding
element (32-bit value) per character in UTF-32? As far as I know, with
composite characters in UTF-32 the individual 32-bit values can still be
treated as individual characters in the cases where they can't be
reduced to a single code point. A counter-example would be nice.

Cheers,

- Alf
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,196
Messages
2,571,036
Members
47,631
Latest member
kukuh

Latest Threads

Top