"convert" string to bytes without changing data (encoding)

R

Ross Ridge

Chris Angelico said:
Actually, he is justified. It's one thing to work in C or assembly and
write code that depends on certain bit-pattern representations of data
(although even that causes trouble - assuming that
sizeof(int)=3D=3Dsizeof(int*) isn't good for portability), but in a high
level language, you cannot assume any correlation between objects and
bytes. Any code that depends on implementation details is risky.

How does that in anyway justify Evan Driscoll maliciously lying about
code he's never seen?

Ross Ridge
 
M

Mark Lawrence

How does that in anyway justify Evan Driscoll maliciously lying about
code he's never seen?

Ross Ridge

We appear to have a case of "would you stand up please, your voice is
rather muffled". I can hear all the *plonks* from miles away.
 
S

Steven D'Aprano

How does that in anyway justify Evan Driscoll maliciously lying about
code he's never seen?

You are perfectly justified to complain about Evan making sweeping
generalisations about your code when he has not seen it; you are NOT
justified in making your own sweeping generalisations that he is not just
lying but *maliciously* lying. He might be just confused by the strength
of his emotions and so making an honest mistake. Or he might have guessed
perfectly accurately about your code, and you are the one being
dishonest. Who knows?

Evan's impassioned rant is based on his estimate of your mindset, namely
that you are the sort of developer who writes code making assumptions
about implementation details even when explicitly told not to by the
library authors. I have no idea whether Evan's estimate is right or not,
but I don't think it is justified based on the little amount we've seen
of you.

Your reaction is to make an equally unjustified estimate of Evan's
mindset, namely that he is not just wrong about you, but *deliberately
and maliciously* lying about you in the full knowledge that he is wrong.
If anything, I would say that you have less justification for calling
Evan a malicious liar than he has for calling you the sort of person who
would write to an implementation instead of an interface.
 
P

Peter Daum

2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1'
chars. When done, encode back to 'latin-1' and the non-ascii chars will
be as they originally were.

.... actually, in the beginning of my quest, I ran into an decoding
exception trying to read data as "latin1" (which was more or less what
I had expected anyway because byte values between 128 and 160 are not
defined there).

Obviously, I must have misinterpreted something there;
I just ran a little test:

l=[i for i in range(256)]; b=bytes(l)
s=b.decode('latin1'); b=s.encode('latin1'); s=b.decode('latin1')
for c in s:
print(hex(ord(c)), end=' ')
if (ord(c)+1) % 16 ==0: print("")
print()

.... and got all the original bytes back. So it looks like I tried to
solve a problem that did not exist to start with (the problems, I ran
into then were pretty real, though ;-)
3. Decode using encoding = 'ascii', errors='surrogate_escape'. This
reversibly encodes the unknown non-ascii chars as 'illegal' non-chars
(using the surrogate-pair second-half code units). This is probably the
safest in that invalid operations on the non-chars should raise an
exception. Re-encoding with the same setting will reproduce the original
hi-bit chars. The main danger is passing the illegal strings out of your
local sandbox.

Unfortunately, this is a very well-kept secret unless you know that
something with that name exists. The options currently mentioned in the
documentation are not really helpful, because the non-decodeable will
be lost. With some trying, I got it to work, too (the option is named
"surrogateescape" without the "_" and in python 3.1 it exists, but only
not as a keyword argument: "s=b.decode('utf-8','surrogateescape')" ...)

Thank you very much for your constructive advice!

Regards,
Peter
 
P

Peter Daum

2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1'
chars. When done, encode back to 'latin-1' and the non-ascii chars will
be as they originally were.

.... actually, in the beginning of my quest, I ran into an decoding
exception trying to read data as "latin1" (which was more or less what
I had expected anyway because byte values between 128 and 160 are not
defined there).

Obviously, I must have misinterpreted something there;
I just ran a little test:

l=[i for i in range(256)]; b=bytes(l)
s=b.decode('latin1'); b=s.encode('latin1'); s=b.decode('latin1')
for c in s:
print(hex(ord(c)), end=' ')
if (ord(c)+1) % 16 ==0: print("")
print()

.... and got all the original bytes back. So it looks like I tried to
solve a problem that did not exist to start with (the problems, I ran
into then were pretty real, though ;-)
3. Decode using encoding = 'ascii', errors='surrogate_escape'. This
reversibly encodes the unknown non-ascii chars as 'illegal' non-chars
(using the surrogate-pair second-half code units). This is probably the
safest in that invalid operations on the non-chars should raise an
exception. Re-encoding with the same setting will reproduce the original
hi-bit chars. The main danger is passing the illegal strings out of your
local sandbox.

Unfortunately, this is a very well-kept secret unless you know that
something with that name exists. The options currently mentioned in the
documentation are not really helpful, because the non-decodeable will
be lost. With some trying, I got it to work, too (the option is named
"surrogateescape" without the "_" and in python 3.1 it exists, but only
not as a keyword argument: "s=b.decode('utf-8','surrogateescape')" ...)

Thank you very much for your constructive advice!

Regards,
Peter
 
R

Ross Ridge

Steven D'Aprano said:
Your reaction is to make an equally unjustified estimate of Evan's
mindset, namely that he is not just wrong about you, but *deliberately
and maliciously* lying about you in the full knowledge that he is wrong.

No, Evan in his own words admitted that his post was ment to be harsh,
"a bit harsher than it deserves", showing his malicious intent. He made
accusations that where neither supported by anything I've said in this
thread nor by the code I actually write. His accusation about me were
completely made up, he was not telling the truth and had no reasonable
basis to beleive he was telling the truth. He was malicously lying and
I'm completely justified in saying so.

Just to make it clear to all you zealots. I've not once advocated writing
any sort "risky code" in this thread. I have not once advocated writing
any style of code in thread. Just because I refuse to drink the "it's
impossible to represent strings as a series of bytes" kool-aid does't mean
that I'm a heretic that must oppose against everything you believe in.

Ross Ridge
 
E

Evan Driscoll

I don't see how you could feel the least bit justified. Well meaning,
if unhelpful, lies about the nature Python strings in order to try to
convince someone to follow what you think are good programming practices
is one thing. Maliciously lying about someone else's code that you've
never seen is another thing entirely.

I'm not even talking about code that you or the OP has written. I'm
talking about your suggestion that

I can in fact say what the internal byte string representation
of strings is any given build of Python 3.

Aside from the questionable truth of this assertion (there's no
guarantee that an implementation uses one consistent encoding or data
structure representation consistently), that's of no consequence because
you can't depend on what the representation is. So why even bring it up?

Also irrelevant is:

In practice the number of ways that CPython (the only Python 3
implementation) represents strings is much more limited.
Pretending otherwise really isn't helpful.

If you can't depend on CPython's implementation (and, I would argue,
your code is broken if you do), then it *is* helpful. Saying that "you
can just look at what CPython does" is what is unhelpful.


That said, looking again I did misread your post that I sent that harsh
reply to; I was looking at it perhaps a bit too much through the lens of
the CPython comment I said above, and interpreting it as "I can say what
the internal representation is of CPython, so just give me that" and
launched into my spiel. If that's not what was intended, I retract my
statement. As long as everyone is clear on the fact that Python 3
implementations can use whatever encoding and data structures they want,
perhaps even different encodings or data structures for equal strings,
and that as a consequence saying "what's the internal representation of
this string" is a meaningless question as far as Python itself is
concerned, I'm happy.

Evan
 
T

Terry Reedy

No, Evan in his own words admitted that his post was ment to be harsh,

I agree that he should have restrained and censored his writing.
Just because I refuse to drink the
"it's impossible to represent strings as a series of bytes" kool-aid

I do not believe *anyone* has made that claim. Is this meant to be a
wild exaggeration? As wild as Evan's?

In my first post on this thread, I made three truthful claims.

1. A 3.x text string is logically a sequence of unicode 'characters'
(codepoints).

2. The Python language definition does not require that a string be
bytes or become bytes unless and until it is explicitly encoded.

3. The intentionally hidden byte implementation of strings on byte
machines is version and system dependent. The bytes used for a
particular character is (in 3.3) context dependent.

As it turns out, the OP had mistakenly assumed that the hidden byte
implementation of 3.3 strings was both well-defined and something
(utf-8) that it is not and (almost certainly) never will be. Guido and
most other devs strongly want string indexing (and hence slice endpoint
finding) to be O(1).

So all of the above is moot as far as the OP's problem is concerned. I
already gave him the three standard solutions.
 
P

Prasad, Ramit

Technically, ASCII goes up to 256 but they are not A-z letters.
Technically, ASCII is 7-bit, so it goes up to 127.

No, ASCII only defines 0-127. Values >=128 are not ASCII.


ASCII includes definitions for 128 characters: 33 are non-printing
control characters (now mostly obsolete) that affect how text and
space is processed and 95 printable characters, including the space
(which is considered an invisible graphic).


Doh! I was mistaking extended ASCII for ASCII. Thanks for the
correction.

Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--


-----Original Message-----
From: [email protected]
[mailto:p[email protected]] On
Behalf Of MRAB
Sent: Wednesday, March28, 2012 2:50 PM
To: (e-mail address removed)
Subject: Re: "convert" string to bytes without changing data (encoding)

It might be technically possible to recreate internal implementation,
or get the byte data. That does not mean it will make any sense or
be understood in a meaningful manner. I think Ian summarized it
very well:
--
http://mail.python.org/mailman/listinfo/python-list
This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.
 
R

Ross Ridge

Ross said:
Just because I refuse to drink the
"it's impossible to represent strings as a series of bytes" kool-aid

Terry Reedy said:
I do not believe *anyone* has made that claim. Is this meant to be a
wild exaggeration? As wild as Evan's?

Sorry, it would've been more accurate to label the flavour of kool-aid
Chris Angelico was trying to push as "it's impossible ... without
encoding":

What is a string? It's not a series of bytes. You can't convert
it without encoding those characters into bytes in some way.
In my first post on this thread, I made three truthful claims.

I'm not objecting to every post made in this thread. If your post had
been made before the original poster had figured it out on his own,
I would've hoped he would have found it much more convincing than what
I quoted above.

Ross Ridge
 
C

Chris Angelico

Sorry, it would've been more accurate to label the flavour of kool-aid
Chris Angelico was trying to push as "it's impossible ... without
encoding":

       What is a string? It's not a series of bytes. You can't convert
       it without encoding those characters into bytes in some way.

I still stand by that statement. Do you try to convert a "dictionary
of filename to open file object" into a "series of bytes" inside
Python? It doesn't matter that, on some level, it's *stored as* a
series of bytes; the actual object *is not* a series of bytes. There
is no logical equivalency, ergo it is illogical and nonsensical to
expect to turn one into the other without some form of encoding.
Python does include an encoding that can handle lists and
dictionaries. It's called Pickle, and it returns (in Python 3) a bytes
object - which IS a series of bytes. It doesn't simply return some
internal representation.

ChrisA
 
S

Steven D'Aprano

Doh! I was mistaking extended ASCII for ASCII. Thanks for the
correction.

There actually is no such thing as "extended ASCII" -- there is a whole
series of many different "extended ASCIIs". If you look at the encodings
available in (for example) Thunderbird, many of the ISO-8859-* and
Windows-* encodings are "extended ASCII" in the sense that they extend
ASCII to include bytes 128-255. Unfortunately they all extend ASCII in a
different way (hence they are different encodings).
 
S

Steven D'Aprano

No, Evan in his own words admitted that his post was ment to be harsh,
"a bit harsher than it deserves", showing his malicious intent.

Being harsher than it deserves is not synonymous with malicious. You are
making assumptions about Evan's mental state that are not supported by
the evidence. Evan may believe that by "punishing" (for some feeble sense
of punishment) you harshly, he is teaching you better behaviour that will
be to your own benefit; or that it will act as a warning to others.
Either way he may believe that he is actually doing good.

And then he entirely undermined his own actions by admitting that he was
over-reacting. This suggests that, in fact, he wasn't really motivated by
either malice or beneficence but mere frustration.

It is quite clear that Evan let his passions about writing maintainable
code get the best of him. His rant was more about "people like you" than
you personally.

Evan, if you're reading this, I think you owe Ross an apology for flying
off the handle. Ross, I think you owe Evan an apology for unjustified
accusations of malice.

He made
accusations that where neither supported by anything I've said

Now that is not actually true. Your posts have defended the idea that
copying the raw internal byte representation of strings is a reasonable
thing to do. You even claimed to know how to do so, for any version of
Python (but so far have ignored my request for you to demonstrate).

in this
thread nor by the code I actually write. His accusation about me were
completely made up, he was not telling the truth and had no reasonable
basis to beleive he was telling the truth. He was malicously lying and
I'm completely justified in saying so.

No, they were not completely made up. Your posts give many signs of being
somebody who might very well write code to the implementation rather than
the interface. Whether you are or not is a separate question, but your
posts in this thread indicate that you very likely could be.

If this is not the impression you want to give, then you should
reconsider your posting style.

Ross, to be frank, your posting style in this thread has been cowardly
and pedantic, an obnoxious combination. Please take this as constructive
criticism and not an attack -- you have alienated people in this thread,
leading at least one person to publicly kill-file your future posts. I
choose to assume you aren't aware of why that is than that you are doing
so deliberately.

Without actually coming out and making a clear, explicit statement that
you approve or disapprove of the OP's attempt to use implementation
details, you *imply* support without explicitly giving it; you criticise
others for saying it can't be done without demonstrating that it can be
done. If this is a deliberate rhetorical trick, then shame on you for
being a coward without the conviction to stand behind concrete
expressions of your opinion. If not, then you should be aware that you
are using a rhetorical style that will make many people predisposed to
think you are a twat.

You *might* have said

Guys, you're technically wrong about this. This is how you can
retrieve the internal representation of a string as a sequence
of bytes: ...code... but you shouldn't use this in production
code because it is fragile and depends on implementation details
that may break in PyPy and Jython and IronPython.

But you didn't.

You *might* have said

Wrong, you can convert a string into a sequence of bytes without
encoding or decoding: ...code... but don't do this.

But you didn't.

Instead you puffed yourself up as a big shot who was more technically
correct than everyone else, but without *actually* demonstrating that you
can do what you said you can do. You labelled as "bullshit" our attempts
to discourage the OP from his misguided approached.

If your intention was to put people off-side, you succeeded very well. If
not, you should be aware that you have, and consider how you might avoid
this in the future.
 
M

Michael Ströder

Steven said:
There actually is no such thing as "extended ASCII" -- there is a whole
series of many different "extended ASCIIs". If you look at the encodings
available in (for example) Thunderbird, many of the ISO-8859-* and
Windows-* encodings are "extended ASCII" in the sense that they extend
ASCII to include bytes 128-255. Unfortunately they all extend ASCII in a
different way (hence they are different encodings).

Yupp.

Looking at RFC 1345 some years ago (while having to deal with EBCDIC) made
this all pretty clear to me. I appreciate that someone did this heavy work of
collecting historical encodings.

Ciao, Michael.
 
S

Serhiy Storchaka

28.03.12 21:13, Heiko Wundram напиÑав(ла):
Reading from stdin/a file gets you bytes, and
not a string, because Python cannot automagically guess what format the
input is in.

In Python3 reading from stdin gets you string. Use sys.stdin.buffer.raw
for access to byte stream. And reading from file opened in text mode
gets you string too.
 
C

Chris Angelico

28.03.12 21:13, Heiko Wundram напиÑав(ла):



In Python3 reading from stdin gets you string. Use sys.stdin.buffer.raw for
access to byte stream. And reading from file opened in text mode gets you
string too.

True. But that's only if it's been told the encoding of stdin (which I
believe is the normal case on Linux). It's still not "automagically
guess(ing)", it's explicitly told.

ChrisA
 
P

Piet van Oostrum

Ross Ridge said:
But it is in fact only stored in one particular way, as a series of bytes.
No, it can be stored in different ways. Certainly in Python 3.3 and
beyond. And in 3.2 also, depending on wide/narrow build.
 
P

Piet van Oostrum

Heiko Wundram said:
Reading from stdin/a file gets you bytes, and
not a string, because Python cannot automagically guess what format the
input is in.
Huh?

Python 3.3.0rc1 (v3.3.0rc1:8bb5c7bc46ba, Aug 25 2012, 10:09:29)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.<class 'str'>
 
N

Nobody


Oh, it can certainly guess (in the absence of any other information, it
uses the current locale). Whether or not that guess is correct is a
different matter.

Realistically, if you want sensible behaviour from Python 3.x, you need
to use an ISO-8859-1 locale. That ensures that conversion between str and
bytes will never fail, and an str-bytes-str or bytes-str-bytes round-trip
will pass data through unmangled.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,145
Messages
2,570,826
Members
47,372
Latest member
LucretiaFo

Latest Threads

Top