Unicode and stream

B

Basil

Hello.

I have compiler BC Builder 6.0.

I have an example:

#include <strstrea.h>

int main () {
wchar_t ff [10] = {' s','d ', 'f', 'g', 't'};
istrstream b1 (ff);
return 0;
}

This example have compile error.
Error message: Could not find a match for ' istrstream:: istrstream (wchar_t *).

Questions:

1. Can I have a Unicode stream?
2. If it is impossible, can I work with Unicode without the OS tools?
I want work with Unicode only by language tools.
3. Is there the other compilers with support Unicode streams?
4. What is about Unicode stream in the standard?

Thank Basil
 
B

Bob Hairgrove

Hello.

I have compiler BC Builder 6.0.

I have an example:

#include <strstrea.h>

strstream is deprecated, although many people find it useful in some
circumstances. But you seem to want to use variable-length buffers
with normal (wide) text. Therefore, use wstringstream (or stringstream
for narrow strings) instead.

Also, standard headers do not use the ".h" extension anymore. To
include stringstreams, you need:

#include said:
int main () {
wchar_t ff [10] = {' s','d ', 'f', 'g', 't'};

You should prefix your literal char elements with L, i.e.:
wchar_t ff [10] = {L's', L'd', L'f', L'g', L't'};

I am assuming that the spurious spaces are typos ... ??
istrstream b1 (ff);

std::wistringstream b1 (ff);
return 0;
}

This example have compile error.
Error message: Could not find a match for ' istrstream:: istrstream (wchar_t *).

Questions:

1. Can I have a Unicode stream?

You can have a stream of wide characters, i.e. wchar_t. Since Unicode
is not part of the C++ language, whether or not you can store ANY
Unicode string in a wide stream depends on the kind of Unicode
encoding you hae chosen and whether or not each of the Unicode
characters fit into a wchar_t. Some don't and require multi-character
(3 or 4 byte) representations.
2. If it is impossible, can I work with Unicode without the OS tools?
I want work with Unicode only by language tools.

Here again, you can work with wide streams using standard C++. But
Unicode is a special set of encodings which can be implemented using
wide streams. Actually, there is one Unicode encoding (UTF-8) which
can be implemented using narrow streams and strings.

Since Unicode is not part of the C++ standard, you will need some
additional support for working with Unicode. If you want to use ONLY
standard C++, you would have to write your own encoding libraries.
Otherwise, if your OS supports it (and Windows does offer a great deal
of Unicode support), I would take advantage of that. There are several
API functions such as WideCharToMultiByte, etc. you can use ... you
aren't stuck with VCL objects such as AnsiString, etc.
3. Is there the other compilers with support Unicode streams?

All standards-compliant compilers support wide streams.
4. What is about Unicode stream in the standard?

Nothing, AFAIK.
 
D

Dietmar Kuehl

Basil said:
#include <strstrea.h>

The above header is not part of the C++ standard. However, the header
<strstream>
with its deprecated strstream classes is. I assume you just mistyped
its name and
forgot an appropriate namespace 'std' qualification (or using
directive).
int main () {
wchar_t ff [10] = {' s','d ', 'f', 'g', 't'};
istrstream b1 (ff);
return 0;
}
Error message: Could not find a match for ' istrstream:: istrstream
(wchar_t *).

'istrstream' is for narrow characters and there is no wide character
version as
'istrstream' is not a class template like its replacement
'basic_istringstream'
(at least conceptually using 'std::basic_string' as its
representation). This
should work:

std::wistringstream b1(ff);

Note that 'ff' has to be null terminated for this to work. Since the
array 'ff'
has more elements than mentioned in the initializer list, it is filled
up by
null characters. I would consider this to be a pure accident and would
rather
write it like this:

whcar_t ff[] = L"sdfgt";

String literals, whether using narrow or wide characters, are always
automatically
null terminated.
1. Can I have a Unicode stream?

You can have a wide character stream. Unicode is an external encoding
and it
does not make much sense to talk of a Unicode stream (*). You can have
Unicode
encoding of the stuff written externally, if the implementation ships
with an
appropriate code conversion facet ('std::codecvt') or if you have a
suitable
implementation thereof (e.g. Dinkumware, <www.dinkumware.com> offers a
library
doing things like this; you can implement it yourself if you want to).

(*) At least conceptually this should be true. Unfortunately, the
Unicode people
messed up Unicode entirely and a program processing Unicode cannot
really be
completely Unicode agnostic: special treatment of combining
characters is
necessary at least.
2. If it is impossible, can I work with Unicode without the OS tools?


I wouldn't call it impossible. Inconvenient may be a better term.
However,
processing of Unicode is always inconvenient. This was apparently a
major design
goal of Unicode although the stated goals were somewhat different...
3. Is there the other compilers with support Unicode streams?

Standard conforming implementations at least allow processing of
Unicode by means
of the code conversion facets. However, the C++ standard does not
define which
external codes need to be supported. Internally, C++ is guaranteed to
process
wide characters. However, these may be - and on some platforms normally
are - 16
bit entities which are not sufficient to represent Unicode characters
in one
entity (Unicode characters have 20 bit). Of course, even 32 bit
entities would be
insufficient due to stuff messed up by the Unicode people (notably
combining
characters).

The C++ view of character processing is that each character entity
(i.e. each
'char' or 'wchar_t') represents a complete character. Possible multi
width
encodings (UTF-8, UTF-16) are transformed to or form the internal
representation
during reading or writing using the 'std::codecvt' facet (with
appropriate
template parameters). Since 'wchar_t' is often 16 bits rather than the
required
20 bits, this is somewhat dwarved. Processing can still be done using
the C++
mechanisms, e.g. using 'std::basic_string<wchar_t>' (aka
'std::wstring'), but
it becomes much more complex. Of course, the same complexity you will
find with
other processing systems, too.
4. What is about Unicode stream in the standard?

As mentioned above, there are wide character streams but the standard
does not
specifically address Unicode streams.
 
P

Pete Becker

Bob said:
You can have a stream of wide characters, i.e. wchar_t. Since Unicode
is not part of the C++ language, whether or not you can store ANY
Unicode string in a wide stream depends on the kind of Unicode
encoding you hae chosen and whether or not each of the Unicode
characters fit into a wchar_t. Some don't and require multi-character
(3 or 4 byte) representations.

More precisely, some compilers provide a wchar_t that isn't wide enough
for 32-bit Unicode. Unlike Java, C and C++ did not fixate on 16-bit wide
characters -- the size of wchar_t is up to the implementation.
Unfortunately, some compiler writers have fixatedon 16-bit wide
characters, and in that case you have to use one of the UTF-16 encodings
for Unicode.
 
P

Pete Becker

Dietmar said:
The C++ view of character processing is that each character entity
(i.e. each
'char' or 'wchar_t') represents a complete character.

That's a bit too strong. char arrays are explicitly allowed to hold
multi-byte character sequences, as well as state-dependent encodings,
and the C library provides functions for manipulating those character
sequences. wchar_t, however, assumes fixed-width, stateless characters.
 
D

Dietmar Kuehl

Pete said:
That's a bit too strong.

I don't think it really is: OK, there are some 'char's which do not
represent a character but still there are no *character* processing
functions operating on multi-byte characters. There are a few *byte*
processing functions in the C library ('mblen()', 'mbtowc()',
'wctomb()', 'mbtowcs()', and 'wctombs()') and a class template in the
C++ library ('std::codecvt') for conversion between characters and
multi-byte representations thereof. I would not call the individual
bytes they process "characters" although they have the type 'char'.
Maybe this is hairsplitting but this view saved my sanity (the little
bit still remaining, that is :) when dealing with code conversions.
char arrays are explicitly allowed to hold
multi-byte character sequences, as well as state-dependent encodings,
and the C library provides functions for manipulating those character
sequences. wchar_t, however, assumes fixed-width, stateless
characters.

I think "multi-byte character sequences" pretty much gives the two
roles
of the involved sequences: on the one hand a "multi-byte" sequence
which
can be transformed to or from a "character" sequence. Calling the
individual bytes on the multi-byte side "characters" is misleading
- even if the transformation between the bytes and the characters is
the identity function.

In any case, the string, file, or character functions and/or classes
always process characters as units, at least conceptually. If a user
starts using e.g. a 'std::wstring' to hold Unicode characters, he is
probably in for a few surprises, even if 'wchar_t' is large enough to
accomodate UCS-32! For example, the 'size()' function does no longer
count the number of "glyphs" (what is normally considered to be a
character) because e.g. a u-umlaut (the second character of my last
name) is not necessarily represented by one character but possible
encoded as the "u" character followed by the umlaut composing
character.
It is effectively an error to rip such character combinations apart, to
replace just one of them but not the other, etc. The character
functions
(e.g. character classification and manipulation) and the string
functions in the C and the C++ standard library ignore such issues
entirely. A user can take care to obey the Unicode rules but the
library
does not enforce the Unicode rules in any way.
 
P

Pete Becker

Dietmar said:
If a user
starts using e.g. a 'std::wstring' to hold Unicode characters, he is
probably in for a few surprises, even if 'wchar_t' is large enough to
accomodate UCS-32! For example, the 'size()' function does no longer
count the number of "glyphs" (what is normally considered to be a
character) because e.g. a u-umlaut (the second character of my last
name) is not necessarily represented by one character but possible
encoded as the "u" character followed by the umlaut composing
character.

Unicode does not deal with glyphs. Just ask 'em! A 32 bit wide character
is large enough to hold all Unicode characters. All implementations of
Unicode have to deal with combining characters. This isn't a C++ issue.
 
D

Dietmar Kuehl

Pete said:
Unicode does not deal with glyphs. Just ask 'em!

Effectively, a glyph is what a user wants see at some point and in the
description of combining characters (Unicode 4.0, section 2.10) they
definitely talk about glyphs. Also, whether they deal with them or not
is not really that relevant at all: for example, if you count the
"characters" in my name (correctly written; since enough programs get
it
wrong I use a common transformation in most electronic conversation)
you want to get four, independent on whether the "u-umlaut" Unicode
character or a "u" character and a "umlaut" combining character is
used.
If you used a 'std::wstring' to represent the Unicode characters, you
would get four or five depending on what some software choose to
represent the "u-umlaut".
A 32 bit wide character is large enough to hold all Unicode
characters.

I didn't dispute this. However, some Unicode sequences don't make any
sense if you rip apart certain characters, notably the combination of
a Unicode character and a following combining character (which are two
Unicode characters if I got things right).
All implementations of
Unicode have to deal with combining characters. This isn't a C++
issue.

I didn't claim that it is an issue specific to C++. I just pointed out
that the C and C++ libraries do not provide any help in processing
Unicode. In particular, the view taken by the these libraries with
respect to character processing (which does not include the code
conversion facilities, IMO, as these operate on bytes rather than on
characters) is that each character is a fixed sized unit, e.g. of
type 'char' or 'wchar_t' (these two character types are directly
supported; user might choose to use e.g. 'long' if their implementation
has choosen to use a 16 bit entity for 'wchar_t' but this would imply
that they provide a whole bunch of stuff, e.g. suitable facets) and
Unicode does not exactly fit this description, not even UCS-4
(I erronously labeled UCS-4 "UCS-32" in an earlier article). ... and
I think it *is* a C++ issue that C++ has no real Unicode support. Of
course, this *is* also an issue for various other languages - despite
the claims of some proponents of such other languages that the language
has proper Unicode support.
 
P

Pete Becker

Dietmar said:
I didn't dispute this. However, some Unicode sequences don't make any
sense if you rip apart certain characters, notably the combination of
a Unicode character and a following combining character (which are two
Unicode characters if I got things right).

No, that makes perfect sense: it's two Unicode characters, the first
being, say, LATIN SMALL LETTER U (0x0075), and the second being
COMBINING DIAERESIS (0x0308). If you're concerned about keeping those
two Unicode characters together, replace them with the single character
LATIN SMALL LETTER U WITH DIAERESIS (0x00fc).

The point is that in Unicode every code point (i.e. valid numeric value
in a 32-bit representation) always means the same thing; you don't have
to look at context to figure out what it means. That's the basic
requirement for wchar_t, as well. It's not the case for char, though,
because the meaning of a single code point can depend on what comes
after it (first byte in a multi-byte character) or what came before it
(with shift encodings and with the second or subsequent bytes in a
multi-byte character).

As to glyphs, they involve a great deal more than what we might call a
"letter". From the Unicode standard:

The difference between identifying a code value and rendering it
on screen or paper is crucial to understanding the Unicode
Standard's role in text processing. The character identified by
a Unicode value is an abstract entity, such as "LATIN CAPITAL
LETTER A" or "BENGALI DIGIT 5". The mark made on screen or paper,
called a glyph, is a visual representation of the character.
 
P

Pete Becker

Pete said:
As to glyphs, they involve a great deal more than what we might call a
"letter". From the Unicode standard:

The difference between identifying a code value and rendering it
on screen or paper is crucial to understanding the Unicode
Standard's role in text processing. The character identified by
a Unicode value is an abstract entity, such as "LATIN CAPITAL
LETTER A" or "BENGALI DIGIT 5". The mark made on screen or paper,
called a glyph, is a visual representation of the character.

Sorry, thinking too slowly today. I was trying to suggest that we use
different terminology, because "glyph" really isn't what you're talking
about. That's why I said "letter". I think it gets at what we're talking
about: 'u-umlaut', whether it's represented by two Unicode characters or
one, is a single letter, and it's not 'u'. At least, most of the time
it's not. <g>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,181
Messages
2,570,970
Members
47,536
Latest member
VeldaYoung

Latest Threads

Top