On Thu, Sep 12, 2013 at 10:25 AM, Mark Janssen
Well now, this is an area that is not actually well-defined. I would
say 16-bit Unicode is binary data if you're encoding in base 65,536,
just as 8-bit ascii is binary data if you're encoding in base-256.
Which is to say: there is no intervening data to suggest a TYPE.
Unicode is not 16-bit any more than ASCII is 8-bit. And you used the
word "encod[e]", which is the standard way to turn Unicode into bytes
anyway. No, a Unicode string is a series of codepoints - it's most
similar to a list of ints than to a stream of bytes.
And not necessarily ints, for that matter.
Let's be clear: the most obvious, simple, hardware-efficient way to
implement a Unicode string holding arbitrary characters is as an array of
32-bit signed integers restricted to the range 0x0 - 0x10FFFF. That gives
you a one-to-one mapping of int <-> code point.
But it's not the only way. One could implement Unicode strings using any
similar one-to-one mapping. Taking a leaf out of the lambda calculus, I
might implement each code point like this:
NULL pointer <=> Code point 0
^NULL <=> Code point 1
^^NULL <=> Code point 2
^^^NULL <=> Code point 3
and so on, where ^ means "pointer to".
Obviously this is mathematically neat, but practically impractical. Code
point U+10FFFF would require a chain of 1114111 pointer-to-pointer-to-
pointer before the NULL. But it would work. Or alternatively, I might
choose to use floats, mapping (say) 0.25 <=> U+0376. Or whatever.
What we can say, though, is that to represent the full Unicode charset
requires 21 bits per code-point, although you can get away with fewer
bits if you have some out-of-band mechanism for recognising restricted
subsets of the charset. (E.g. you could use just 7 bits if you only
handled the characters in ASCII, or just 3 bits if you only cared about
decimal digits.) In practice, computers tend to be much faster when
working with multiples of 8 bits, so we use 32 bits instead of 21. In
that sense, Unicode is a 32 bit character set.
But Unicode is absolutely not a 16 bit character set.
And of course you can use *more* bits than 21, or 32. If you had a
computer where the native word-size was (say) 50 bits, it would make
sense to use 50 bits per character.
As for the question of "binary data versus text", well, that's a thorny
one, because really *everything* in a computer is binary data, since it's
stored using bits. But we can choose to *interpret* some binary data as
text, just as we interpret some binary data as pictures, sound files,
video, Powerpoint presentations, and so forth. A reasonable way of
defining a text file might be:
If you decode the bytes making up an alleged text file into
code-points, using the correct encoding (which needs to be
known a priori, or stored out of band somehow), then provided
that none of the code-points have Unicode General Category Cc,
Cf, Cs, Co or Cn (control, format, surrogate, private-use,
non-character/reserved), you can claim that it is at least
plausible that the file contains text.
Whether that text is meaningful is another story.
You might wish to allow Cf and possibly even Co (format and private-use),
depending on the application.