String is ASCII or UTF-8?

  • Thread starter C. Benson Manica
  • Start date
C

C. Benson Manica

Hours of Googling has not helped me resolve a seemingly simple
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?
This is python 2.4.3, so I don't have getsizeof available to me.
 
A

Alf P. Steinbach

* C. Benson Manica:
Hours of Googling has not helped me resolve a seemingly simple
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?
This is python 2.4.3, so I don't have getsizeof available to me.

Generally, if you need 100% certainty then you can't tell the encoding from a
sequence of byte values.

However, if you know that it's EITHER ascii or utf-8 then the presence of any
value above 127 (or, for signed byte values, any negative values), tells you
that it can't be ascii, hence, must be utf-8. And since utf-8 is an extension of
ascii nothing is lost by assuming ascii in the other case. So, problem solved.

If the string represents the contents of a file then you may also look for an
UTF-8 represention of the Unicode BOM (Byte Order Mark) at the beginning. If
found then it indicates utf-8 for almost-sure and more expensive searching can
be avoided. It's just three bytes to check.


Cheers & hth.,

- Alf
 
T

Tim Golden

Hours of Googling has not helped me resolve a seemingly simple
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?
This is python 2.4.3, so I don't have getsizeof available to me.

You can't. You can apply one or more heuristics, depending on exactly
what your requirement is. But any valid ASCII text is also valid
UTF8-encoded text since UTF-8 isn't "two bytes per char" but a variable
number of bytes per char.

Obviously, you can test whether all the bytes are less than 128 which
suggests that the text is legal ASCII. But then it's also legal UTF8.
Or you can just attempt to decode and catch the exception:

try:
unicode (text, "ascii")
except UnicodeDecodeError:
print "Not ASCII"


TJG
 
S

Stef Mientki

* C. Benson Manica:

Generally, if you need 100% certainty then you can't tell the encoding
from a sequence of byte values.

However, if you know that it's EITHER ascii or utf-8 then the presence
of any value above 127 (or, for signed byte values, any negative
values), tells you that it can't be ascii,
AFAIK it's completely impossible.
UTF-8 characters have 1 to 4 bytes / byte.
I can create ASCII strings containing byte values between 127 and 255.

cheers,
Stef
 
C

C. Benson Manica

You can't. You can apply one or more heuristics, depending on exactly
what your requirement is. But any valid ASCII text is also valid
UTF8-encoded text since UTF-8 isn't "two bytes per char" but a variable
number of bytes per char.

Hm, well that's very unfortunate. I'm using a database library which
seems to assume that all strings passed to it are ASCII, and I'm
attempting to use it on two different systems - one where all strings
are ASCII, and one where they seem to be UTF-8. The strings come from
the same place, i.e. they're exclusively normal ASCII characters.
What I would want is to check once for whether the strings passed to
function foo() are ASCII or UTF-8, and if they are to assume that all
strings need to be decoded. So that's not possible?
 
R

Richard Brodie

The strings come from the same place, i.e. they're exclusively
normal ASCII characters.

In this case then converting them to/from UTF-8 is a no-op, so
it makes no difference at all.
 
C

C. Benson Manica

In this case then converting them to/from UTF-8 is a no-op, so
it makes no difference at all.

Except to the database library, which seems perfectly happy to send an
8-character UTF-8 string to the database as 16 raw characters...
 
R

Robert Kern

AFAIK it's completely impossible.
UTF-8 characters have 1 to 4 bytes / byte.
I can create ASCII strings containing byte values between 127 and 255.

No, you can't. ASCII strings only have characters in the range 0..127. You could
create Latin-1 (or any number of the 8-bit encodings out there) strings with
characters 0..255, yes, but not ASCII.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
T

Terry Reedy

Hours of Googling has not helped me resolve a seemingly simple
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?

Utf-8 is an encoding that uses 1 to 4 bytes per character.
So it is not clear what you are asking. Alf answered one of the possible
questions.
 
R

Roel Schroeven

Op 2010-03-09 18:31, C. Benson Manica schreef:
Except to the database library, which seems perfectly happy to send an
8-character UTF-8 string to the database as 16 raw characters...

In that case I think you mean UTF-16 or UCS-2 instead of UTF-8. UTF-16
uses 2 or more bytes per character, UCS-2 always uses 2 bytes per
character. UTF-8 uses 1 or more bytes per character.

If your texts are in a Western language, the second byte will be zero in
most characters; you could check for that (but note that the second byte
might be the first one in the byte stream, depending on the byte ordering).

HTH,
Roel

--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov

Roel Schroeven
 
M

Martin v. Loewis

I can create ASCII strings containing byte values between 127 and 255.

No, you can't - or what you create wouldn't be an ASCII string, by
definition of ASCII.

Regards,
Martin
 
E

Emile van Sebille

On 3/9/2010 1:36 PM Stef Mientki said...
On 09-03-2010 18:36, Robert Kern wrote:

Probably, and according to wikipedia you're right.

I too looked at wikipedia, and it seems historically incomplete to me.
In particular, I looked for 'high order ascii', which, when I was
working with Basic Four in the '70's, is what they used. Essentially,
the high order bit was set for all characters to make 8A a line feed,
etc. Still the same 0..127 characters, but not really an extended ascii
which is where wikipedia forwards you to.

I remember having to strap the eighth bit high when I reused the older
line printers to get them to work.

Emile
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,176
Messages
2,570,949
Members
47,500
Latest member
ArianneJsb

Latest Threads

Top