Î
Îικόλαος ΚοÏÏας
typing 16474 in interactive session both in python 2 and 3 gives backIn Python 2:
the number 16474
while we want the the binary representation of the number 16474
typing 16474 in interactive session both in python 2 and 3 gives backIn Python 2:
And a proper UTF-8 decoder will reject "\xC0\x80" and "\xed\xa0\x80", even
though mathematically they would translate into U+0000 and U+D800
respectively. The UTF-16 *mechanism* is limited to no more than Unicode
has currently used, but I'm left wondering if that's actually the other
way around - that Unicode planes were deemed to stop at the point where
UTF-16 can't encode any more.
typing 16474 in interactive session both in python 2 and 3 gives back
the number 16474
while we want the the binary representation of the number 16474
The leading 0b is just syntax to tell you "this is base 2, not base 8
(0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
ints always display in decimal. The only way to display in another base
is to build a string showing what the int would look like in a different
base:
py> hex(16474)
'0x405a'
Notice that the return value of bin, oct and hex are all strings. If they
were ints, then they would display in decimal, defeating the purpose!
Not quite... The leading bit is a 0 -> which means 0..127 are sentSo, the first high-bits are a directive that UTF-8 uses to know how many
bytes each character is being represented as.
0-127 codepoints(characters) use 1 bit to signify they need 1 bit for
storage and the rest 7 bits to actually store the character ?
128..255 -- in what encoding? These all have the leading bit with awhile
128-256 codepoints(characters) use 2 bit to signify they need 2 bits for
storage and the rest 14 bits to actually store the character ?
Isn't 14 bits way to many to store a character ?
[97, 98, 99, 27, 10][ i for i in b'abc\x1b\n' ]
Not quite... The leading bit is a 0 -> which means 0..127 are sent
as-is, no manipulation.
128..255 -- in what encoding? These all have the leading bit with a
value of 1. In 8-bit encodings (ISO-Latin-1) the meaning of those values is
inherent in the specified encoding and they are sent as-is.
1110 starts a three byte sequence, 11110 starts a four byte sequence...
Basically, count the number of leading 1-bits before a 0 bit, and that
tells you how many bytes are in the multi-byte sequence -- and all bytes
that start with 10 are supposed to be the continuations of a multibyte set
(and not a signal that this is a 1-byte entry -- those only have a leading
0)
Original UTF-8 allowed for 31-bits to specify a character in the Unicode
set. It used 6 bytes -- 48 bits total, but 7 bits of the first byte were
the flag (6 leading 1 bits and a 0 bit), and two bits (leading 10) of each
continuation.
Why doesn't it work like this?
leading 0 = 1 byte flag
leading 1 = 2 bytes flag
leading 00 = 3 bytes flag
leading 01 = 4 bytes flag
leading 10 = 5 bytes flag
leading 11 = 6 bytes flag
Wouldn't it be more logical?
| A code-point and the code-point's ordinal value are associated into
| a Unicode charset. They have the so called 1:1 mapping.
|
| So, i was under the impression that by encoding the code-point into
| utf-8 was the same as encoding the code-point's ordinal value into
| utf-8.
|
| So, now i believe they are two different things.
| The code-point *is what actually* needs to be encoded and *not* its
| ordinal value.
Because there is a 1:1 mapping, these are the same thing: a code
point is directly _represented_ by the ordinal value, and the ordinal
value is encoded for storage as bytes.
'0b100000001011010'| > The leading 0b is just syntax to tell you "this is base 2, not base 8
| > (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
|
| But byte objects are represented as '\x' instead of the
| aforementioned '0x'. Why is that?
You're confusing a "string representation of a single number in
some base (eg 2 or 16)" with the "string-ish representation of a
bytes object".
.... print(value, hex(value), bin(value))| How can i view this byte's object representation as hex() or as bin()?
See above. A bytes is a _sequence_ of values. hex() and bin() print
individual values in hexadecimal or binary respectively.
:
Think about it. Let's say that, as per your scheme, a leading 0
indicates "1 byte" (as is indeed the case in UTF8). What things could
follow that leading 0? How does that impact your choice of a leading
00 or 01 for other numbers of bytes?
... okay, you're obviously going to need to be spoon-fed a little more
than that. Here's a byte:
01010101
Is that a single byte representing a code point in the 0-127 range, or
the first of 4 bytes representing something else, in your proposed
scheme? How can you tell?
Now look at the way UTF8 does it:
<http://en.wikipedia.org/wiki/Utf-8#Description>
Really, follow the link and study the table carefully. Don't continue
reading this until you believe you understand the choices that the
designers of UTF8 made, and why they made them.
Pay particular attention to the possible values for byte 1. Do you
notice the difference between that scheme, and yours:
0xxxxxxx
1xxxxxxx
00xxxxxx
01xxxxxx
10xxxxxx
11xxxxxx
If you don't see it, keep looking until you do ... this email gives
you more than enough hints to work it out. Don't ask someone here to
explain it to you. If you want to become competent, you must use your
brain.
Indeed python embraced it in single quoting '0b100000001011010' and
not as 0b100000001011010 which in fact makes it a string.
But since bin(16474) seems to create a string rather than an expected
number(at leat into my mind) then how do we get the binary
representation of the number 16474 as a number?
Hold on!Op 13-06-13 10:08, Îικόλαος ΚοÏÏας schreef:
You don't. You should remember that python (or any programming language)
doesn't print numbers. It always prints string representations of
numbers. It is just so that we are so used to the decimal representation
that we think of that representation as being the number.
Normally that is not a problem but it can cause confusion when you are
working with mulitple representations.
Yes, or if you prefer what python prints is the decimal notation of the number.Hold on!
Youa re basically saying here that:
16474
is nto a number as we think but instead is string representation of a
number?
I dont think so, if it were a string representation of a number that
would print the following:
'16474'
No it doesn't, numbers are abstract concepts that can be represented inPython prints numbers:
but when we need a decimal integer
No it doesn't, numbers are abstract concepts that can be represented in
various notations, these notations are strings. Those notaional strings
end up being printed. As I said before we are so used in using the
decimal notation that we often use the notation and the number interchangebly
without a problem. But when we are working with multiple notations that
can become confusing and we should be careful to seperate numbers from their
representaions/notations.
There are no decimal integers. There is only a decimal notation of the number.
Decimal, octal etc are not characteristics of the numbers themselves.
So everything we see like:
16474
nikos
abc123
everything is a string and nothing is a number? not even number 1?
Am 14.06.2013 10:37, schrieb Nick the Gr33k:
Come on now, this is _so_ obviously trolling, it's not even remotely
funny anymore. Why doesn't killfiling work with the mailing list version
of the python list? :-(
I'mm not trolling man, i just have hard time understanding why numbers
acts as strings.
funny anymore. Why doesn't killfiling work with the mailing list version ofAm 14.06.2013 10:37, schrieb Nick the Gr33k:
Come on now, this is _so_ obviously trolling, it's not even remotely
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.