String is ASCII or UTF-8?

C. Benson Manica · Mar 9, 2010

Hours of Googling has not helped me resolve a seemingly simple
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?
This is python 2.4.3, so I don't have getsizeof available to me.

Alf P. Steinbach · Mar 9, 2010

* C. Benson Manica:

Hours of Googling has not helped me resolve a seemingly simple
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?
This is python 2.4.3, so I don't have getsizeof available to me.

Generally, if you need 100% certainty then you can't tell the encoding from a
sequence of byte values.

However, if you know that it's EITHER ascii or utf-8 then the presence of any
value above 127 (or, for signed byte values, any negative values), tells you
that it can't be ascii, hence, must be utf-8. And since utf-8 is an extension of
ascii nothing is lost by assuming ascii in the other case. So, problem solved.

If the string represents the contents of a file then you may also look for an
UTF-8 represention of the Unicode BOM (Byte Order Mark) at the beginning. If
found then it indicates utf-8 for almost-sure and more expensive searching can
be avoided. It's just three bytes to check.

Cheers & hth.,

- Alf

Tim Golden · Mar 9, 2010

Hours of Googling has not helped me resolve a seemingly simple
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?
This is python 2.4.3, so I don't have getsizeof available to me.

You can't. You can apply one or more heuristics, depending on exactly
what your requirement is. But any valid ASCII text is also valid
UTF8-encoded text since UTF-8 isn't "two bytes per char" but a variable
number of bytes per char.

Obviously, you can test whether all the bytes are less than 128 which
suggests that the text is legal ASCII. But then it's also legal UTF8.
Or you can just attempt to decode and catch the exception:

try:
unicode (text, "ascii")
except UnicodeDecodeError:
print "Not ASCII"

TJG

Stef Mientki · Mar 9, 2010

* C. Benson Manica:

Generally, if you need 100% certainty then you can't tell the encoding
from a sequence of byte values.

However, if you know that it's EITHER ascii or utf-8 then the presence
of any value above 127 (or, for signed byte values, any negative
values), tells you that it can't be ascii,

AFAIK it's completely impossible.
UTF-8 characters have 1 to 4 bytes / byte.
I can create ASCII strings containing byte values between 127 and 255.

cheers,
Stef

C. Benson Manica · Mar 9, 2010

You can't. You can apply one or more heuristics, depending on exactly
what your requirement is. But any valid ASCII text is also valid
UTF8-encoded text since UTF-8 isn't "two bytes per char" but a variable
number of bytes per char.

Hm, well that's very unfortunate. I'm using a database library which
seems to assume that all strings passed to it are ASCII, and I'm
attempting to use it on two different systems - one where all strings
are ASCII, and one where they seem to be UTF-8. The strings come from
the same place, i.e. they're exclusively normal ASCII characters.
What I would want is to check once for whether the strings passed to
function foo() are ASCII or UTF-8, and if they are to assume that all
strings need to be decoded. So that's not possible?

Richard Brodie · Mar 9, 2010

The strings come from the same place, i.e. they're exclusively
normal ASCII characters.

In this case then converting them to/from UTF-8 is a no-op, so
it makes no difference at all.

C. Benson Manica · Mar 9, 2010

In this case then converting them to/from UTF-8 is a no-op, so
it makes no difference at all.

Except to the database library, which seems perfectly happy to send an
8-character UTF-8 string to the database as 16 raw characters...

Robert Kern · Mar 9, 2010

AFAIK it's completely impossible.
UTF-8 characters have 1 to 4 bytes / byte.
I can create ASCII strings containing byte values between 127 and 255.

No, you can't. ASCII strings only have characters in the range 0..127. You could
create Latin-1 (or any number of the 8-bit encodings out there) strings with
characters 0..255, yes, but not ASCII.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Terry Reedy · Mar 9, 2010

Hours of Googling has not helped me resolve a seemingly simple
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?

Utf-8 is an encoding that uses 1 to 4 bytes per character.
So it is not clear what you are asking. Alf answered one of the possible
questions.

Roel Schroeven · Mar 9, 2010

Op 2010-03-09 18:31, C. Benson Manica schreef:

Except to the database library, which seems perfectly happy to send an
8-character UTF-8 string to the database as 16 raw characters...

In that case I think you mean UTF-16 or UCS-2 instead of UTF-8. UTF-16
uses 2 or more bytes per character, UCS-2 always uses 2 bytes per
character. UTF-8 uses 1 or more bytes per character.

If your texts are in a Western language, the second byte will be zero in
most characters; you could check for that (but note that the second byte
might be the first one in the byte stream, depending on the byte ordering).

HTH,
Roel

--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov

Roel Schroeven

Martin v. Loewis · Mar 9, 2010

I can create ASCII strings containing byte values between 127 and 255.

No, you can't - or what you create wouldn't be an ASCII string, by
definition of ASCII.

Regards,
Martin

Emile van Sebille · Mar 9, 2010

On 3/9/2010 1:36 PM Stef Mientki said...

On 09-03-2010 18:36, Robert Kern wrote:

Probably, and according to wikipedia you're right.

I too looked at wikipedia, and it seems historically incomplete to me.
In particular, I looked for 'high order ascii', which, when I was
working with Basic Four in the '70's, is what they used. Essentially,
the high order bit was set for all characters to make 8A a line feed,
etc. Still the same 0..127 characters, but not really an extended ascii
which is where wikipedia forwards you to.

I remember having to strap the eighth bit high when I reused the older
line printers to get them to work.

Emile

hex dump w/ or w/out utf-8 chars	40	Jul 8, 2013
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	58	Sep 29, 2013
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
Unicode (UTF-8) in C	13	Mar 16, 2014
codec for UTF-8 with BOM	3	May 2, 2011
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011
utf-8 and ctypes	5	Sep 28, 2010
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013

String is ASCII or UTF-8?

C. Benson Manica

Alf P. Steinbach

Tim Golden

Stef Mientki

C. Benson Manica

Richard Brodie

C. Benson Manica

Robert Kern

Terry Reedy

Roel Schroeven

Martin v. Loewis

Emile van Sebille

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads