S
Steven D'Aprano
The longer story of my question is: I am new to python (obviously), and
since I am not familiar with either one, I thought it would be advisory
to go for python 3.x. The biggest problem that I am facing is, that I am
often dealing with data, that is basically text, but it can contain
8-bit bytes.
All bytes are 8-bit, at least on modern hardware. I think you have to go
back to the 1950s to find 10-bit or 12-bit machines.
In this case, I can not safely assume any given encoding,
but I actually also don't need to know - for my purposes, it would be
perfectly good enough to deal with the ascii portions and keep anything
else unchanged.
Well you can't do that, because *by definition* you are changing a
CHARACTER into ONE OR MORE BYTES. So the question you have to ask is,
*how* do you want to change them?
You can use an error handler to convert any untranslatable characters
into question marks, or to ignore them altogether:
bytes = string.encode('ascii', 'replace')
bytes = string.encode('ascii', 'ignore')
When going the other way, from bytes to strings, it can sometimes be
useful to use the Latin-1 encoding, which essentially cannot fail:
string = bytes.decode('latin1')
although the non-ASCII chars that you get may not be sensible or
meaningful in any way. But if there are only a few of them, and you don't
care too much, this may be a simple approach.
But in a nutshell, it is physically impossible to map the millions of
Unicode characters to just 256 possible bytes without either throwing
some characters away, or performing an encoding.
As it seems, this would be far easier with python 2.x.
It only seems that way until you try.