[Q] Text vs Binary Files

R

Richard Tobin

Don't want to be seen to be supporting XML here
???

but doesn't the UTF-16 standard define byte ordering?

No. There are names for the encodings corresponding to
big-endian-UTF-16 and little-endian-UTF-16, but UTF-16 itself can be
stored in either order.

XML processors can distinguish between them easily because any XML
document not in UTF-8 must begin with a less-than or a byte-order mark
(unless some external indication of encoding is given).

-- Richard
 
R

Richard Tobin

Malcolm Dew-Jones said:
You can only have byte order issues when you store the UTF-16 as 8 bit
bytes.

Which is to say, always in practice.

-- Richard
 
J

Jeff Brooks

Malcolm said:
Jeff Brooks ([email protected]) wrote:
: Rolf Magnus wrote:
: > Arthur J. O'Dwyer wrote:
: >
: >>On Thu, 27 May 2004, Eric wrote:
: >>
: >>>Assume that disk space is not an issue [...]
: >>>Assume that transportation to another OS may never occur.
: >>>Are there any solid reasons to prefer text files over binary files?
: >>>
: >>>Some of the reasons I can think of are:
: >>>
: >>>-- should transportation to another OS become useful or needed,
: >>> the text files would be far easier to work with
: >>
: >> I would guess this is wrong, in general. Think of the difference
: >>between a DOS/Win32 text file, a MacOS text file, and a *nix text
: >>file (hint: linefeeds and carriage returns).
: >
: > Linefeeds and carriage returns don't matter in XML. The other
: > differences are ruled out by specifying the encoding. Any XML parser
: > should understand utf-8.

: Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16
: has byte ordering issues.

You can only have byte order issues when you store the UTF-16 as 8 bit
bytes. But a stream of 8 bit bytes is _not_ UTF-16, which by definition
is a stream of 16 bit entities, so it is not the UTF-16 that has byte
order issues.

http://www.unicode.org/unicode/faq/utf_bom.html#37

Jeff Brooks
 
B

Ben Measures

Jeff said:
Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16
has byte ordering issues. Writing an UTF-16 file on different cpus can
result in text files that are different. This can be resolved because of
the encoding the the UTF standards use but it means that any true XML
parser must deal with high-endian, low-endian issues.

"All XML processors MUST accept the UTF-8 and UTF-16 encodings of
Unicode 3.1"
- http://www.w3.org/TR/REC-xml/#charsets

"Entities encoded in UTF-16 MUST [snip] begin with the Byte Order Mark
described by section 2.7 of [Unicode3]"
http://www.w3.org/TR/REC-xml/#charencoding

This makes it trivial to overcome any endian issues, and since endian
issues are so fundamental I don't see it as making XML any less portable.
 
M

Michael Wojcik

[Followups restricted to comp.programming.]

"All XML processors MUST accept the UTF-8 and UTF-16 encodings of
Unicode 3.1"
- http://www.w3.org/TR/REC-xml/#charsets

"The primary feature of Unicode 3.1 is the addition of 44,946 new
encoded characters. ...

For the first time, characters are encoded beyond the original 16-bit
codespace or Basic Multilingual Plane (BMP or Plane 0). These new
characters, encoded at code positions of U+10000 or higher, are
synchronized with the forthcoming standard ISO/IEC 10646-2."
- http://www.unicode.org/reports/tr27/

The majority of XML parsers only use 16-bit characters. This means that
the majority of XML parsers can't actually read XML.

I don't believe this is correct. UTF-16 encodes characters in U+10000
- U+10FFFF as surrogate pairs. None of the surrogate code points match
any of the scalar code points, so there's no ambiguity - all surrogate
pairs are composed of 16-bit values that can't be mistaken for scalar
UTF-16 characters.

As long as the parser processes the surrogate pair without altering
it and recognizes it unambiguously, the parser would seem to be
complying with the XML specification. None of those characters (in
their surrogate-pair UTF-16 representation or any other) has any
special meaning in XML, so a parser that treated the surrogate pair
as a pair of 16-bit characters should do just fine.

In other words, the parser doesn't have to recognize that characters
from U+10000 and up (in their surrogate-pair encoding) are special,
because to it they aren't special.

The only case that immediately comes to mind where the distinction
would matter is if the parser had an API that returned data character-
by-character, which should have special provisions for surrogate
pairs (or be documented as returning them in halves). However, I've
not seen such a parser, AFAIK, and I don't know why one would provide
such an API.

Or, I suppose, if the parser offered to transform the document data
among various supported encodings. In that case, not handling UTF-16
surrogate pairs would indeed be a bug. On the other hand, I'm not
sure such transformations are necessarily the job of an XML parser;
that could be considered a bug in a set of additional utilities
provided alongside the parser.
 
D

Donald Roby

*Again* I urge the consultation of the RFCs defining any standard
binary file format, and the notice of the complete lack of regard
for big-endian/little-endian/19-bit-int/37-bit-int issues. At the
byte level, these things simply never come up.

Try (for example) RFC 1314.

These things certainly do come up, and they're handled by encoding the
rules in a header of the format.
 
A

Arthur J. O'Dwyer

Try (for example) RFC 1314.

[RFC defining among other things a subset(?) of the TIFF image
file format]
These things certainly do come up, and they're handled by
encoding the rules in a header of the format.

Not really. TIFF /is/ weird in that it explicitly provides
both a "big-endian" format and a "little-endian" format, and TIFF
readers have to provide routines to read both formats. But the
endianness/word size of the machine never comes up. If it did,
we wouldn't be able to write TIFF writers or readers that worked
on platforms with different endiannesses. (IIRC, this whole thread
was started way back in the mists of time with the idea that

fputs("42000\n", fp);

produces different results on different machines (because of the
embedded newline, which produces different bytes on different
systems; not to mention the possibility of EBCDIC!), while

unsigned int result = 42000;
unsigned char buffer[8];
buffer[0] = (result>>24)&0xFF;
buffer[1] = (result>>16)&0xFF;
buffer[2] = (result>>8)&0xFF;
buffer[3] = (result>>0)&0xFF;
fwrite(buffer, 1, 4, fp);

produces the exact same bytes on every platform. Thus "binary
is better than text" if you care about portability more than
human-readability.

But since we already had that discussion (several months ago,
IIRC), I'm not going to get back into it.

-Arthur,
signing off
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,146
Messages
2,570,832
Members
47,374
Latest member
anuragag27

Latest Threads

Top