How to know the encoding of XML file?

davisjoseph · Sep 13, 2005

Hi All,

I'm newbie to this XML world. My problem is to identify the encoding
type of XML at runtime. What currently I'm doing is checking whether
BOM is available in the XML; based on the BOM I'm identifying the
encoding type. Here is the problem, some type of UTF-8 encoded file
does'nt have BOM in the starting. So I'm identying the file as
iso-8859-1 encoded which is actually encoded in UTF-8.

I dont have much idea about the encoding technolgy also.

Is there any way to identify the encoding type of XML file
programtically; I can use Xerces C++ library or any other free library
to identify the correct encoding. Any other work around is also
welcome.

Thanks & Regards

Shmuel (Seymour J.) Metz · Sep 13, 2005

In <[email protected]>, on
09/13/2005
at 04:01 AM, (e-mail address removed) said:

Here is the problem, some type of UTF-8 encoded file
does'nt have BOM in the starting.

Why would any UTF-8 file have a BOM? That's for encodings with 16-bit
bytes, such as UTF-16. UTF-8 uses 8-bit bytes.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to (e-mail address removed)

Martin Honnen · Sep 13, 2005

I'm newbie to this XML world. My problem is to identify the encoding
type of XML at runtime. What currently I'm doing is checking whether
BOM is available in the XML; based on the BOM I'm identifying the
encoding type. Here is the problem, some type of UTF-8 encoded file
does'nt have BOM in the starting. So I'm identying the file as
iso-8859-1 encoded which is actually encoded in UTF-8.

Well for XML there are clear rules, if there is no XML declaration
specifying the encoding then it can only be UTF-8 or UTF-16 encoded and
that is something you can decide with the BOM respectively the existance
of the BOM (e.g. UTF-16 always needs one, UTF-8 BOM is optional).
So look at the BOM and the XML declaration (that <?xml
version="version.number" encoding="encoding-is-here"?>) to find the
encoding for XML:
<http://www.w3.org/TR/REC-xml/#charencoding>
Of course what you really do with the above is detect the encoding the
XML document is supposed to be in and an XML parser then has to check
the whole document to comply with that encoding, e.g. if you read the
XML declaration saying encoding="ISO-8859-1" that means the XML is
supposed to be in that encoding and a parser then checks whether any
byte sequences are encountered which can't be decoded properly using
that encoding.

In general there needs to be a declaration of the encoding associated
with a document (e.g. in XML in the XML declaration, in HTML in a <meta>
element, or for resources accessed via HTTP in the response header) as
there is no general algorithm to detect any encoding that exists. For
instance you can not detect whether a document is meant to be ISO-8859-1
encoded or ISO-8859-15 encoded, the document author has to declare the
encoding, the same bytes are just interpreted as different characters.

Manuel Collado · Sep 13, 2005

Shmuel (Seymour J.) Metz escribió:

In <[email protected]>, on
09/13/2005
at 04:01 AM, (e-mail address removed) said:

Why would any UTF-8 file have a BOM? That's for encodings with 16-bit
bytes, such as UTF-16. UTF-8 uses 8-bit bytes.

In mixed Unicode/non-unicode environments the BOM helps to discriminate
between Unicode/UTF-8 files and simpler ASCII/ISO-8859-x/... text files.

Alan J. Flavell · Sep 13, 2005

Why would any UTF-8 file have a BOM?

FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29

That's for encodings with 16-bit bytes, such as UTF-16.

Except that the encoding schemes utf-16BE and utf-16LE use 16-bit code
units (I'd avoid using the term "bytes"), but don't need a BOM,
because their endian-ness is specified by the name of the encoding
scheme.

Shmuel (Seymour J.) Metz · Sep 13, 2005

on 09/13/2005 said:
: > Why would any UTF-8 file have a BOM?
: FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29

Note that the file doesn't contain a BOM, but rather the UTF-8
encoding of a BOM. An actual BOM would not be valid UTF-8.

(I'm still waiting for hardware that increases character sizes.

For most hardware, character size is irrelevant. Some devices deal
with large blocks of data. Some deal with graphical data rather than
text. Some deal with individual bits. Keyboards deal with scan codes
rather than conventional character representations. The only common PC
peripherals that I can think of that actually deal with characters as
characters are a display adapter or printer in text mode, and those
are essentially obsolete.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to (e-mail address removed)

Malcolm Dew-Jones · Sep 13, 2005

Alan J. Flavell ([email protected]) wrote:
: On Tue, 13 Sep 2005, Shmuel (Seymour J.) Metz wrote:

: > Why would any UTF-8 file have a BOM?

: FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29

: > That's for encodings with 16-bit bytes, such as UTF-16.

: Except that the encoding schemes utf-16BE and utf-16LE use 16-bit code
: units (I'd avoid using the term "bytes"), but don't need a BOM,
: because their endian-ness is specified by the name of the encoding
: scheme.

utf-16BE and utf-16LE must be using 8 bit bytes, because if they were
using true 16-bit code units then there would be no endian-ness to
consider.

(I'm still waiting for hardware that increases character sizes. They've
done it for all other elementary units on the computer, integers, memory
pointers, etc, but for some reason not this one.)

Alan J. Flavell · Sep 13, 2005

utf-16BE and utf-16LE must be using 8 bit bytes,

That's the distinction (as set out in recent Unicode terminologies)
between the Character Encoding Form (which in all these three cases is
designated utf-16, consisting of 16-bit code units), and its Character
Encoding Schemes (of which there are the three: utf-16 with BOM,
utf-16LE, and utf-16BE) for representing the 16-bit code units as an
octet stream.

See chapter 2, sections 2.5 and 2.6 , e.g
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf
as well as the previously-cited FAQs

because if they were using true 16-bit code units then there would
be no endian-ness to consider.

It's unfortunate that when one reads "utf-16", without context, it is
unclear whether it's meant to refer to the C.E.F (and thus to comprise
all three C.E.Ses), or only to the one C.E.S. Perhaps it's a pity
they didn't devise different designations for the CEF and for the CES
(maybe "utf-16BOM" for the CES).

(This isn't a problem for utf-8, since there is only one CES for
that particular CEF, with the BOM being optional.)

(I'm still waiting for hardware that increases character sizes.

Historically, there has been at least one machine with 36-bit words
that could be used as four 9-bit units; but that's past rather than
future!

They've done it for all other elementary units on the computer,
integers, memory pointers, etc, but for some reason not this one.)

I suspect you're more interested in raising it to 16 bits (or 32) than
to some non-multiple of 8, though.

best

Alan J. Flavell · Sep 13, 2005

Note that the file doesn't contain a BOM, but rather the UTF-8
encoding of a BOM.

*No* data stream ever literally "contains" a BOM, any more than it
"contains" a copyright sign, or the letter "A" (the BOM, just like any
Unicode character, is an abstract concept): what a data stream
contains is the BOM encoded according to the appropriate "Character
Encoding Scheme". That's the whole point of the BOM, so that the
character encoding scheme can be recognised by inspecting the
encoding. So there were no surprises there.

An actual BOM would not be valid UTF-8.

An "actual BOM" is an abstract concept!

The idea of dumping the hexadecimal number x'FEFF' into a utf-8 data
stream - if that was what you had in mind - would make no sense, any
more than dumping x'00A9' into it would make any sense to represent
the copyright sign. Isn't that obvious?

Let's cut them some slack: when they say that it "contains a BOM",
they are taking it for granted that it means "appropriately encoded".
You can't put an abstract concept into a data stream *without* an
appropriate encoding, after all.

How to convert CSV to parquet file without RLE_DICTIONARY encoding?	0	Sep 2, 2022
How to save textBox values into a xml-file(with naming an choosing directory)?	1	Aug 23, 2022
How to remove the password from Outlook PST File?	4	Jun 19, 2024
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
converting xml file to schema file problem	2	Oct 7, 2009
How to keep the order of executing tasks? - Help needed.	1	Feb 21, 2023
Determine encoding of XML file with xerces-c and SAXParser	2	Jan 24, 2007
How to set broadcast receiver attributes programmatically in android studio?	1	Mar 19, 2022

How to know the encoding of XML file?

davisjoseph

Shmuel (Seymour J.) Metz

Martin Honnen

Manuel Collado

Alan J. Flavell

Shmuel (Seymour J.) Metz

Malcolm Dew-Jones

Alan J. Flavell

Alan J. Flavell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads