A few questiosn about encoding

  • Thread starter Íéêüëáïò Êïýñáò
  • Start date
J

Joel Goldstick

On 15/6/2013 5:59 μμ, Roy Smith wrote:

And, yes, especially in networking, everybody talks about octets when

1 byte = 8 bits

in networking though since we do not use encoding schemes with variable
lengths like utf-8 is, how do we separate when a byte value start and when
it stops?

do we need a start bit and a stop bit for that?
 
D

Dennis Lee Bieber

It depends on the context.

Maybe the OP should give up on Python and switch to Regina/Rexx...

-=-=-=-=-=-
/* */

numint = 123 /* a "number" */
numstr = def /* unknown variable? */
strstr = "abc" /* a string containing alphabetics */
strint = "456" /* a string containing decimal digits */

signal on syntax name next1
say "Adding strstr and numint"
say strstr + numint
next1:
signal on syntax name next2
say "Adding strstr and strint"
say strstr + strint
next2:
signal on syntax name next3
say "Adding numint and strint"
say numint + strint
next3:
signal on syntax name next4
say "Adding numstr and numint"
say numstr + numint
next4:
say "Concatenate numint and strstr"
say numint || strstr
say "Concatenate strint and numint"
say strint || numint
say "Concatenate numstr and strint"
say numstr || strint
say "Concatenate numstr and numint"
say numstr || numint

-=-=-=-=-=-

E:\UserData\Wulfraed\MYDOCU~1>rexx t.rx
Adding strstr and numint
Adding strstr and strint
Adding numint and strint
579
Adding numstr and numint
Concatenate numint and strstr
123abc
Concatenate strint and numint
456123
Concatenate numstr and strint
DEF456
Concatenate numstr and numint
DEF123

E:\UserData\Wulfraed\MYDOCU~1>

{Pity SYNTAX error can't be trapped with a CALL, i'd have been able to
cleanly report results and return to the next statement}
 
N

Nick the Gr33k

The only thing that i didn't understood is this line.
First please tell me what is a byte value


\x1b is a character(ESC) represented in hex format

b'\x1b' is a byte object that represents what?


'\x1b'

After decoding it gives the char ESC in hex format
Shouldn't it result in value 27 which is the ordinal of ESC ?


Why Unicode charset doesn't just contain characters, but instead it
contains a mapping of (characters <--> ordinals) ?

I mean what we do is to encode a character like chr(65).encode('utf-8')

What's the reason of existence of its corresponding ordinal value since
it doesn't get involved into the encoding process?

Thank you very much for taking the time to explain.

Can someone please explain these questions too?
 
B

Benjamin Schollnick

Nick,

I'm sorry are you not listening?

1b is a HEXADECIMAL Number. As a so-called programmer, did you seriously not consider that?

Try this:

1) Open a Web browser
2) Go to Google.com
3) Type in "Hex 1B"
4) Click on the first link
5) In the Hexadecimal column find 1B.

Or open your favorite calculator, and convert Hexadecimal 1B to Decimal (Base 10).

- Benjamin
 
A

Antoon Pardon

Op 15-06-13 02:28, Cameron Simpson schreef:
| So, a numeral = a string representation of a number. Is this correct?

No, a numeral is an individual digit from the string representation of a number.
So: 65 requires two numerals: '6' and '5'.
Wrong context. A numeral as an individual digit is when you are talking about
individual characters in a font. In such a context the set of glyphs that
represent a digit are the numerals.

However in a context of programming, numerals in general refer to the set of
strings that represent a number.
 
G

Guy Scree

I recommend that all participants in this thread, especially Alex and
Anton, research the term "Pathological Altruism"
 
C

Chris Angelico

I recommend that all participants in this thread, especially Alex and
Anton, research the term "Pathological Altruism"

I don't intend to buy a book about it, but based on flipping through a
few Google results and snippets, I'm thinking that this is the
"Paladin fault" that I know from Dungeons & Dragons. :)

ChrisA
 
R

Rick Johnson

Gah! That's twice I've screwed that up.
Sorry about that!

Yeah, and your difficulty explaining the Unicode implementation reminds me of a passage from the Python zen:

"If the implementation is hard to explain, it's a bad idea."
 
S

Steven D'Aprano

Yeah, and your difficulty explaining the Unicode implementation reminds
me of a passage from the Python zen:

"If the implementation is hard to explain, it's a bad idea."

The *implementation* is easy to explain. It's the names of the encodings
which I get tangled up in.


ASCII: Supports exactly 127 code points, each of which takes up exactly 7
bits. Each code point represents a character.

Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and
about a gazillion other legacy charsets, all of which are mutually
incompatible: supports anything from 127 to 65535 different code points,
usually under 256.

UCS-2: Supports exactly 65535 code points, each of which takes up exactly
two bytes. That's fewer than required, so it is obsoleted by:

UTF-16: Supports all 1114111 code points in the Unicode charset, using a
variable-width system where the most popular characters use exactly two-
bytes and the remaining ones use a pair of characters.

UCS-4: Supports exactly 4294967295 code points, each of which takes up
exactly four bytes. That is more than needed for the Unicode charset, so
this is obsoleted by:

UTF-32: Supports all 1114111 code points, using exactly four bytes each.
Code points outside of the range 0 through 1114111 inclusive are an error.

UTF-8: Supports all 1114111 code points, using a variable-width system
where popular ASCII characters require 1 byte, and others use 2, 3 or 4
bytes as needed.


Ignoring the legacy charsets, only UTF-16 is a terribly complicated
implementation, due to the surrogate pairs. But even that is not too bad.
The real complication comes from the interactions between systems which
use different encodings, and that's nothing to do with Unicode.
 
M

MRAB

The *implementation* is easy to explain. It's the names of the encodings
which I get tangled up in.
You're off by one below!

ASCII: Supports exactly 127 code points, each of which takes up exactly 7
bits. Each code point represents a character.
128 codepoints.
Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and
about a gazillion other legacy charsets, all of which are mutually
incompatible: supports anything from 127 to 65535 different code points,
usually under 256.
128 to 65536 codepoints.
UCS-2: Supports exactly 65535 code points, each of which takes up exactly
two bytes. That's fewer than required, so it is obsoleted by:
65536 codepoints.

etc.
 
R

Rick Johnson

The *implementation* is easy to explain. It's the names of
the encodings which I get tangled up in.

Well, ignoring the fact that you're last explanation is
still buggy, you have not actually described an
"implementation", no, you've merely generalized ( and quite
vaguely i might add) the technical specification of a few
encoding. Let's ask Wikipedia to enlighten us on the
subject of "implementation":

############################################################
# Define: Implementation #
############################################################
# In computer science, an implementation is a realization #
# of a technical specification or algorithm as a program, #
# software component, or other computer system through #
# computer programming and deployment. Many #
# implementations may exist for a given specification or #
# standard. For example, web browsers contain #
# implementations of World Wide Web Consortium-recommended #
# specifications, and software development tools contain #
# implementations of programming languages. #
############################################################

Do you think someone could reliably implement the alphabet of a new
language in Unicode by using the general outline you
provided? -- again, ignoring your continual fumbling when
explaining that simple generalization :)

Your generalization is analogous to explaining web browsers
as: "software that allows a user to view web pages in the
range www.*" Do you think someone could implement a web
browser from such limited specification? (if that was all
they knew?).

============================================================
Since we're on the subject of Unicode:
============================================================
One the most humorous aspects of Unicode is that it has
encodings for Braille characters. Hmm, this presents a
conundrum of sorts. RIDDLE ME THIS?!

Since Braille is a type of "reading" for the blind by
utilizing the sense of touch (therefore DEMANDING 3
dimensions) and glyphs derived from Unicode are
restrictively two dimensional, because let's face it people,
Unicode exists in your computer, and computer screens are
two dimensional... but you already knew that -- i think?,
then what is the purpose of a Unicode Braille character set?

That should haunt your nightmares for some time.
 
A

Andrew Berg

One the most humorous aspects of Unicode is that it has
encodings for Braille characters. Hmm, this presents a
conundrum of sorts. RIDDLE ME THIS?!

Since Braille is a type of "reading" for the blind by
utilizing the sense of touch (therefore DEMANDING 3
dimensions) and glyphs derived from Unicode are
restrictively two dimensional, because let's face it people,
Unicode exists in your computer, and computer screens are
two dimensional... but you already knew that -- i think?,
then what is the purpose of a Unicode Braille character set?
Two dimensional characters can be made into 3 dimensional shapes.
Building numbers are a good example of this.
We already have one Unicode troll; do we really need you too?
 
R

Rick Johnson

On 2013.06.20 08:40, Rick Johnson wrote:
Two dimensional characters can be made into 3 dimensional shapes.

Yes in the real world. But what about on your computer
screen? How do you plan on creating tactile representations of
braille glyphs on my monitor? Hey, if you can already do this,
please share, as it sure would make internet porn more
interesting!
Building numbers are a good example of this.

Either the matrix is reality or you must live inside your
computer as a virtual being. Is your name Tron? Are you a pawn
of Master Control? He's such a tyrant!
 
C

Chris Angelico

Yes in the real world. But what about on your computer
screen? How do you plan on creating tactile representations of
braille glyphs on my monitor? Hey, if you can already do this,
please share, as it sure would make internet porn more
interesting!

I had a device for creating embossed text. It predated Unicode by a
couple of years at least (not sure how many, because I was fairly
young at the time). It was made by a company called Epson, it plugged
into the computer via a 25-pin plug, and when it was properly
functioning, it had a ribbon of ink that it would bash through to
darken the underside of the embossed text. But sometimes that ribbon
slipped out of position, and we had beautifully-hammered ASCII text,
unsullied by ink. And since the device did graphics too, it could be
used for the entire Unicode character set if you wanted.

Not sure that it would improve your porn any, but I've no doubt you
could try if you wanted.

ChrisA
 
C

Chris Angelico

Your generalization is analogous to explaining web browsers
as: "software that allows a user to view web pages in the
range www.*" Do you think someone could implement a web
browser from such limited specification? (if that was all
they knew?).

Wow. That spec isn't limited, it's downright faulty. Or do you really
think that (a) there is such a thing as the "range www.*", and that
(b) that "range" has anything to do with web browsers?

ChrisA
 
W

wxjmfauth

Le jeudi 20 juin 2013 13:43:28 UTC+2, MRAB a écrit :
You're off by one below!





128 codepoints.








128 to 65536 codepoints.






65536 codepoints.



etc.

And all these coding schemes have something in common,
they work all with a unique set of code points, more
precisely a unique set of encoded code points (not
the set of implemented code points (byte)).

Just what the flexible string representation is not
doing, it artificially devides unicode in subsets and try
to handle eache subset differently.

On this other side, that is because it is impossible to
work properly with multiple sets of encoded code points
that all these coding schemes exist today. There are simply
no other way.

Even "exotic" schemes like "CID-fonts" used in pdf
are based on that scheme.

jmf
 
C

Chris Angelico

And all these coding schemes have something in common,
they work all with a unique set of code points, more
precisely a unique set of encoded code points (not
the set of implemented code points (byte)).

Just what the flexible string representation is not
doing, it artificially devides unicode in subsets and try
to handle eache subset differently.


UTF-16 divides Unicode into two subsets: BMP characters (encoded using
one 16-bit unit) and astral characters (encoded using two 16-bit units
in the D800::/5 netblock, or equivalent thereof). Your beloved narrow
builds are guilty of exactly the same crime as the hated 3.3.

ChrisA
 
M

MRAB

UTF-16 divides Unicode into two subsets: BMP characters (encoded using
one 16-bit unit) and astral characters (encoded using two 16-bit units
in the D800::/5 netblock, or equivalent thereof). Your beloved narrow
builds are guilty of exactly the same crime as the hated 3.3.
UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4
bytes, and those who previously used ASCII still need only 1 byte per
codepoint!
 
C

Chris Angelico

UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4
bytes, and those who previously used ASCII still need only 1 byte per
codepoint!

Yes, but there's never (AFAIK) been a Python implementation that
represents strings in UTF-8; UTF-16 was one of two options for Python
2.2 through 3.2, and is the one that jmf always seems to be measuring
against.

ChrisA
 
J

Jussi Piitulainen

Rick said:
Yes in the real world. But what about on your computer screen? How
do you plan on creating tactile representations of braille glyphs on
my monitor? Hey, if you can already do this, please share, as it
sure would make internet porn more interesting!

Search for braille display on the web. A wikipedia article also led me
to braille e-book. (Or search for braille porn, since you are so
inclined - the concept turns out to be already out there on the web.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,135
Messages
2,570,783
Members
47,340
Latest member
orhankaya

Latest Threads

Top