hex dump w/ or w/out utf-8 chars

B

blatt

Hi all,
but a particular hello to Chris Angelino which with their critics and
suggestions pushed me to make a full revision of my application on
hex dump in presence of utf-8 chars.
If you are not using python 3, the utf-8 codec can add further programming
problems, especially if you are not a guru....
The script seems very long but I commented too much ... sorry.
It is very useful (at least IMHO...)
It works under Linux. but there is still a little problem which I didn't
solve (at least programmatically...).


# -*- coding: utf-8 -*-
# px.py vers. 11 (pxb.py) # python 2.6.6
# hex-dump w/ or w/out utf-8 chars
# Using spaces as separators, this script shows
# (better than tabnanny) uncorrect indentations.

# to save output > python pxb.py hex.txt > px9_out_hex.txt

nLenN=3 # n. of digits for lines

# version almost thoroughly rewritten on the ground of
# the critics and modifications suggested by Chris Angelico

# in the first version the utf-8 conversion to hex was shown horizontaly:

# 005 # qwerty: non è unicode bensì ascii
# 2 7767773 666 ca 7666666 6667ca 676660
# 3 175249a efe 38 5e93f45 25e33c 13399a

# ... but I had to insert additional chars to keep the
# synchronization between the literal and the hex part

# 005 # qwerty: non è. unicode bensì. ascii
# 2 7767773 666 ca 7666666 6667ca 676660
# 3 175249a efe 38 5e93f45 25e33c 13399a

# in the second version I followed Chris suggestion:
# "to show the hex utf-8 vertically"

# 005 # qwerty: non è unicode bensì ascii
# 2 7767773 666 c 7666666 6667c 676660
# 3 175249a efe 3 5e93f45 25e33 13399a
# a a
# 8 c

# between the two solutions, I selected the first one + syncronization,
# which seems more compact and easier to program (... I'm lazy...)

# various run options:
# std : python px.py file
# bash cat : cat file | python px.py (alias hex)
# bash echo: echo line | python px.py " "

# works on any n. of bytes for utf-8

# For the user: it is helpful to have in a separate file
# all special characters of interest, together with their names.

# error:

# echo '345"789"'|hex > 345"789" 345"789"
# 33323332 instead of 333233320
# 3452789 a " " 34527892a

# ... correction: avoiding "\n at end of test-line
# echo "345'789'"|hex > 345'789'
# 333233320
# 34577897a

# same error in every run option

# If someone can solve this bug...

###################


import fileinput
import sys, commands

lF=[] # input file as list
for line in fileinput.input(): # handles all the details of args-or-stdin
lF.append(line)
sSpacesXLN = ' ' * (nLenN+1)


for n in xrange(len(lF)):
sLineHexND=lF[n].encode('hex') # ND = no delimiter (space)
sLineHex =lF[n].encode('hex').replace('20',' ')
sLineHexH =sLineHex[::2]
sLineHexL =sLineHex[1::2]

sSynchro=''
for k in xrange(0,len(sLineHexND),2):
if sLineHexND[k]<'8':
sSynchro+= sLineHexND[k]+sLineHexND[k+1]
k+=1
elif sLineHexND[k]=='c':
sSynchro+='c'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k+3]+'2e'
k+=3
elif sLineHexND[k]=='e':
sSynchro+='e'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k+3]+\
sLineHexND[k+4]+sLineHexND[k+5]+'2e2e'
k+=5

# text output (synchroinized)
print str(n+1).zfill(nLenN)+' '+sSynchro.decode('hex'),
print sSpacesXLN + sLineHexH
print sSpacesXLN + sLineHexL+ '\n'


If there are problems of understanding, probably due to fonts, the best
thing is import it in an editor with "mono" fonts...

As I already told to Chris... critics are welcome!

Bye, Blatt.
 
C

Chris Angelico

Hi all,
but a particular hello to Chris Angelino which with their critics and
suggestions pushed me to make a full revision of my application on
hex dump in presence of utf-8 chars.

Hiya! Glad to have been of assistance :)
As I already told to Chris... critics are welcome!

No problem.
# -*- coding: utf-8 -*-
# px.py vers. 11 (pxb.py) # python 2.6.6
# hex-dump w/ or w/out utf-8 chars
# Using spaces as separators, this script shows
# (better than tabnanny) uncorrect indentations.

# to save output > python pxb.py hex.txt > px9_out_hex.txt

nLenN=3 # n. of digits for lines

# chomp heaps and heaps of comments

Little nitpick, since you did invite criticism :) When I went to copy
and paste your code, I skipped all the comments and started at the
line of hashes... and then didn't have the nLenN definition. Posting
code to a forum like this is a huge invitation to try the code (it's
the very easiest way to know what it does), so I would recommend
having all your comments at the top, and all the code in a block
underneath. It'd be that bit easier for us to help you. Not a big
deal, though, I did figure out what was going on :)
sLineHex =lF[n].encode('hex').replace('20',' ')

Here's the problem. Your hex string ends with "220a", and the
replace() method doesn't concern itself with the divisions between
bytes. It finds the second 2 of 22 and the leading 0 of 0a and
replaces them.

I think the best solution may be to avoid the .encode('hex') part,
since it's not available in Python 3 anyway. Alternatively (if Py3
migration isn't a concern), you could do something like this:

sLineHexND=lF[n].encode('hex') # ND = no delimiter (space)
sLineHex =sLineHexND # No reason to redo the encoding
twentypos=0
while True:
twentypos=sLineHex.find("20",twentypos)
if twentypos==-1: break # We've reached the end of the string
if not twentypos%2: # It's at an even-numbered position, replace it
sLineHex=sLineHex[:twentypos]+' '+sLineHex[twentypos+2:]
twentypos+=1
# then continue on as before
sLineHexH =sLineHex[::2]
sLineHexL =sLineHex[1::2]
[ code continues ]

Hope that helps!

ChrisA
 
S

Steven D'Aprano

Hi all,
but a particular hello to Chris Angelino which with their critics and
suggestions pushed me to make a full revision of my application on hex
dump in presence of utf-8 chars.

I don't understand what you are trying to say. All characters are UTF-8
characters. "a" is a UTF-8 character. So is "ă".

If you are not using python 3, the utf-8 codec can add further
programming problems,

On the contrary, I find that so long as you understand what you are doing
it solves problems, not adds them. However, if you are confused about the
difference between characters (text strings) and bytes, or if you are
dealing with arbitrary binary data and trying to treat it as if it were
UTF-8 encoded text, then you can have errors. Those errors are a good
thing.

especially if you are not a guru.... The script
seems very long but I commented too much ... sorry. It is very useful
(at least IMHO...)
It works under Linux. but there is still a little problem which I didn't
solve (at least programmatically...).


# -*- coding: utf-8 -*-
# px.py vers. 11 (pxb.py)
# python 2.6.6 # hex-dump w/ or w/out utf-8 chars
# Using spaces as separators, this script shows
# (better than tabnanny) uncorrect indentations.

The word you are looking for is "incorrect".

# to save output > python pxb.py hex.txt > px9_out_hex.txt

nLenN=3 # n. of digits for lines

# version almost thoroughly rewritten on the ground of
# the critics and modifications suggested by Chris Angelico

# in the first version the utf-8 conversion to hex was shown
horizontaly:

# 005 # qwerty: non è unicode bensì ascii
# 2 7767773 666 ca 7666666 6667ca 676660
# 3 175249a efe 38 5e93f45 25e33c 13399a

Oh! We're supposed to read the output *downwards*! That's not very
intuitive. It took me a while to work that out. You should at least say
so.

# ... but I had to insert additional chars to keep the
# synchronization between the literal and the hex part

# 005 # qwerty: non è. unicode bensì. ascii
# 2 7767773 666 ca 7666666 6667ca 676660
# 3 175249a efe 38 5e93f45 25e33c 13399a

Well that sucks, because now sometimes you have to read downwards
(character 'q' -> hex 71, reading downwards) and sometimes you read both
downwards and across (character 'è' -> hex c3a8). Sometimes a dot means a
dot and sometimes it means filler. How is the user supposed to know when
to read down and when across?

# in the second version I followed Chris suggestion:
# "to show the hex utf-8 vertically"

You're already showing UTF-8 characters vertically, if they happen to be
a one-byte character. Better to be consistent and always show characters
vertical, regardless of whether they are one, two or four bytes.

# 005 # qwerty: non è unicode bensì ascii
# 2 7767773 666 c 7666666 6667c 676660
# 3 175249a efe 3 5e93f45 25e33 13399a
# a a
# 8 c

Much better! Now at least you can trivially read down the column to see
the bytes used for each character. As an alternative, you can space each
character to show the bytes horizontally, displaying spaces and other
invisible characters either as dots, backslash escapes, or Unicode
control pictures, whichever you prefer. The example below uses dots for
spaces and backslash escape for newline:

q w e r t y : . n o n . è . u n i
71 77 65 72 74 79 3a 20 6e 6f 6e 20 c3 a8 20 75 6e 69

c o d e . b e n s ì . a s c i i \n
63 6f 64 65 20 62 65 6e 73 c3 ac 20 61 73 63 69 69 0a


There will always be some ambiguity between (e.g.) dot representing a
dot, and it representing an invisible control character or space, but the
reader can always tell them apart by reading the hex value, which you
*always* read horizontally whether it is one byte, two or four. There's
never any confusion whether you should read down or across.

Unfortunately, most fonts don't support the Unicode control pictures. But
if you choose to use them, here they are, together with their Unicode
name. You can use the form

'\N{...}' # Python 3
u'\N{...}' # Python 2

to get the characters, replacing ... with the name shown below:


†SYMBOL FOR NULL
â SYMBOL FOR START OF HEADING
â‚ SYMBOL FOR START OF TEXT
⃠SYMBOL FOR END OF TEXT
â„ SYMBOL FOR END OF TRANSMISSION
â… SYMBOL FOR ENQUIRY
↠SYMBOL FOR ACKNOWLEDGE
⇠SYMBOL FOR BELL
∠SYMBOL FOR BACKSPACE
≠SYMBOL FOR HORIZONTAL TABULATION
⊠SYMBOL FOR LINE FEED
â‹ SYMBOL FOR VERTICAL TABULATION
⌠SYMBOL FOR FORM FEED
â SYMBOL FOR CARRIAGE RETURN
⎠SYMBOL FOR SHIFT OUT
â SYMBOL FOR SHIFT IN
â SYMBOL FOR DATA LINK ESCAPE
â‘ SYMBOL FOR DEVICE CONTROL ONE
â’ SYMBOL FOR DEVICE CONTROL TWO
â“ SYMBOL FOR DEVICE CONTROL THREE
â” SYMBOL FOR DEVICE CONTROL FOUR
â• SYMBOL FOR NEGATIVE ACKNOWLEDGE
â– SYMBOL FOR SYNCHRONOUS IDLE
â— SYMBOL FOR END OF TRANSMISSION BLOCK
☠SYMBOL FOR CANCEL
â™ SYMBOL FOR END OF MEDIUM
âš SYMBOL FOR SUBSTITUTE
â› SYMBOL FOR ESCAPE
✠SYMBOL FOR FILE SEPARATOR
â SYMBOL FOR GROUP SEPARATOR
âž SYMBOL FOR RECORD SEPARATOR
⟠SYMBOL FOR UNIT SEPARATOR
â  SYMBOL FOR SPACE
â¡ SYMBOL FOR DELETE
⢠BLANK SYMBOL
⣠OPEN BOX
⤠SYMBOL FOR NEWLINE
⥠SYMBOL FOR DELETE FORM TWO
⦠SYMBOL FOR SUBSTITUTE FORM TWO


(I wish more fonts would support these characters, they are very useful.)


[...]
# works on any n. of bytes for utf-8

# For the user: it is helpful to have in a separate file
# all special characters of interest, together with their names.

In Python, you can use the unicodedata module to look up characters by
name, or given the character, find out what it's name is.


[...]
import fileinput
import sys, commands

lF=[] # input file as list
for line in fileinput.input(): # handles all the details of args-or- stdin
lF.append(line)


That is more easily written as:

lF = list(fileinput.input())

and better written with a meaningful file name. Whenever you have a
variable, and find the need to give a comment explaining what the
variable name means, you should consider a more descriptive name.

When that name is a cryptic two letter name, that goes double.

sSpacesXLN = ' ' * (nLenN+1)


for n in xrange(len(lF)):
sLineHexND=lF[n].encode('hex') # ND = no delimiter (space)

You're programming like a Pascal or C programmer. There is nearly never
any need to write code like that in Python. Rather than iterate over the
indexes, then extract the part you want, it is better to iterate directly
over the parts you want:

for line in lF:
sLineHexND = line.encode('hex')


sLineHex =lF[n].encode('hex').replace('20',' ')
sLineHexH =sLineHex[::2]
sLineHexL =sLineHex[1::2]

Trying to keep code lined up in this way is a bad habit to get into. It
just sets you up for many hours of unproductive adding and deleting
spaces trying to keep things aligned.

Also, what on earth are all these "s" prefixes?
sSynchro=''
for k in xrange(0,len(sLineHexND),2):

Probably the best way to walk through a string, grabbing the characters
in pairs, comes from the itertools module: see the recipe for "grouper".

http://docs.python.org/2/library/itertools.html

Here is a simplified version:

assert len(line) % 2 == 0
for pair in zip(*([iter(line)]*2)):
...

although understanding how it works requires a little advanced knowledge.

if sLineHexND[k]<'8':
sSynchro+= sLineHexND[k]+sLineHexND[k+1]
k+=1
elif sLineHexND[k]=='c':
sSynchro+='c'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k +3]+'2e'
k+=3
elif sLineHexND[k]=='e':
sSynchro+='e'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k +3]+\
sLineHexND[k+4]+sLineHexND[k+5]+'2e2e'
k+=5

Apart from being hideously ugly to read, I do not believe this code works
the way you think it works. Adding to the loop variable doesn't advance
the loop. Try this and see for yourself:


for i in range(10):
print(i)
i += 5


The loop variable just gets reset once it reaches the top of the loop
again.
 
F

ferdy.blatsco

Hi Chris,
glad to have received your contribution, but I was expecting much more
critics...
Starting from the "little nitpick" about the comment dispositon in my
script... you are correct... It is a bad habit on my part to place
variables subjected to change at the beginning of the script... and then
forget it...
About the mistake due to replace, you gave me a perfect explanation.
Unfortunately (as probably I told you before) I will never pass to
Python 3... Guido should not always listen only to gurus like him...
I don't like Python as before...starting from OOP and ending with codecs
like utf-8. Regarding OOP, much appreciated expecially by experts, he
could use python 2 for hiding the complexities of OOP (improving, as an
effect, object's code hiding) moving classes and objects to
imported methods, leaving in this way the programming style to the
well known old style: sequential programming and functions.
About utf-8... the same solution: keep utf-8 but for the non experts, add
methods to convert to solutions which use the range 128-255 of only one
byte (I do not give a damn about chinese and "similia"!...)
I know that is a lost battle (in italian "una battaglia persa")!

Bye, Blatt
 
C

Chris Angelico

Unfortunately (as probably I told you before) I will never pass to
Python 3... Guido should not always listen only to gurus like him...
I don't like Python as before...starting from OOP and ending with codecs
like utf-8. Regarding OOP, much appreciated expecially by experts, he
could use python 2 for hiding the complexities of OOP (improving, as an
effect, object's code hiding) moving classes and objects to
imported methods, leaving in this way the programming style to the
well known old style: sequential programming and functions.
About utf-8... the same solution: keep utf-8 but for the non experts, add
methods to convert to solutions which use the range 128-255 of only one
byte (I do not give a damn about chinese and "similia"!...)
I know that is a lost battle (in italian "una battaglia persa")!

Well, there won't be a Python 2.8, so you really should consider
moving at some point. Python 3.3 is already way better than 2.7 in
many ways, 3.4 will improve on 3.3, and the future is pretty clear.
But nobody's forcing you, and 2.7.x will continue to get
bugfix/security releases for a while. (Personally, I'd be happy if
everyone moved off the 2.3/2.4 releases. It's not too hard supporting
2.6+ or 2.7+.)

The thing is, you're thinking about UTF-8, but you should be thinking
about Unicode. I recommend you read these articles:

http://www.joelonsoftware.com/articles/Unicode.html
http://unspecified.wordpress.com/20...e-of-language-level-abstract-unicode-strings/

So long as you are thinking about different groups of characters as
different, and wanting a solution that maps characters down into the
<256 range, you will never be able to cleanly internationalize. With
Python 3.3+, you can ignore the differences between ASCII, BMP, and
SMP characters; they're all just "characters". Everything works
perfectly with Unicode.

ChrisA
 
F

ferdy.blatsco

Hi Steven,

thank you for your reply... I really needed another python guru which
is also an English teacher! Sorry if English is not my mother tongue...
"uncorrect" instead of "incorrect" (I misapplied the "similarity
principle" like "unpleasant...>...uncorrect").

Apart from these trifles, you said:Not using python 3, for me (a programmer which was present at the beginningof
computer science, badly interacting with many languages from assembler to
Fortran and from c to Pascal and so on) it was an hard job to arrange the
abrupt transition from characters only equal to bytes to some special
characters defined with 2, 3 bytes and even more.
I should have preferred another solution... but i'm not Guido....!

I said:
in the first version the utf-8 conversion to hex was shown horizontally And you replied:
You are correct, but I was only referring to "special characters"...
My main concern was compactness of output and besides that every group of
bytes used for defining "special characters" is well represented with high
nibble in the range outside ascii 0-127.

Your following observations are connected more or less to the above point
and sorry if the interpretation of output... sucks!
I think that, for the interested user, all the question is of minor
importance.

Only another point is relevant for me:Apart your kind observation (... "hideously ugly to read") referring to
my code snippet incrementing the loop variable... you are correct.
I will never make the same mistake!

Bye, Blatt.
 
C

Chris Angelico

Not using python 3, for me (a programmer which was present at the beginning of
computer science, badly interacting with many languages from assembler to
Fortran and from c to Pascal and so on) it was an hard job to arrange the
abrupt transition from characters only equal to bytes to some special
characters defined with 2, 3 bytes and even more.

Even back then, bytes and characters were different. 'A' is a
character, 0x41 is a byte. And they correspond 1:1 if and only if you
know that your characters are represented in ASCII. Other encodings
(eg EBCDIC) mapped things differently. The only difference now is that
more people are becoming aware that there are more than 256 characters
in the world.

Like Magic 2014 and its treatment of Slivers, at some point you're
going to have to master the difference between bytes and characters,
or else be eternally hacking around stuff in your code, so now is as
good a time as any.

ChrisA
 
D

Dave Angel

Hi Steven,

thank you for your reply... I really needed another python guru which
is also an English teacher! Sorry if English is not my mother tongue...
"uncorrect" instead of "incorrect" (I misapplied the "similarity
principle" like "unpleasant...>...uncorrect").

Apart from these trifles, you said:
Not using python 3, for me (a programmer which was present at the beginning of
computer science, badly interacting with many languages from assembler to
Fortran and from c to Pascal and so on) it was an hard job to arrange the
abrupt transition from characters only equal to bytes to some special
characters defined with 2, 3 bytes and even more.

Characters do not have a width. They are Unicode code points, an
abstraction. It's only when you encode them in byte strings that a code
point takes on any specific width. And some encodings go to one-byte
strings (and get errors for characters that don't match), some go to
two-bytes each, some variable, etc.
I should have preferred another solution... but i'm not Guido....!

But Unicode has nothing to do with Guido, and it has existed for about
25 years (if I recall correctly). It's only that Python 3 is finally
embracing it, and making it the default type for characters, as it
should be. As far as I'm concerned, the only reason it shouldn't have
been done long ago was that programs were trying to fit on 640k DOS
machines. Even before Unicode, there were multi-byte encodings around
(eg. Microsoft's MBCS), and each was thoroughly incompatible with all
the others. And the problem with one-byte encodings is that if you need
to use a Greek currency symbol in a document that's mostly Norwegian (or
some such combination of characters), there might not be ANY valid way
to do it within a single "character set."

Python 2 supports all the same Unicode features as 3; it's just that it
defaults to byte strings. So it's HARDER to get it right.

Except for special purpose programs like a file dumper, it's usually
unnecessary for a Python 3 programmer to deal with individual bytes from
a byte string. Text files are a bunch of bytes, and somebody has to
interpret them as characters. If you let open() handle it, and if you
give it the correct encoding, it just works. Internally, all strings
are Unicode, and you don't care where they came from, or what human
language they may have characters from. You can combine strings from
multiple places, without much worry that they might interfere.


Windows NT/2000/XP/Vista/7 has used Unicode for its file system (NTFS)
from the beginning (approx 1992), and has had Unicode versions of each
of its API's for nearly as long.

I appreciate you've been around a long time, and worked in a lot of
languages. I've programmed professionally in at least 35 languages
since 1967. But we've come a long way from the 6bit characters I used
in 1968. At that time, we packed them 10 characters to each word.
 
C

Chris Angelico

But Unicode has nothing to do with Guido, and it has existed for about 25
years (if I recall correctly).

Depends how you measure. According to [1], the work kinda began back
then (25 years ago being 1988), but it wasn't till 1991/92 that the
spec was published. Also, the full Unicode range with multiple planes
came about in 1996, with Unicode 2.0, so that could also be considered
the beginning of Unicode. But that still means it's nearly old enough
to drink, so programmers ought to be aware of it.

[1] http://en.wikipedia.org/wiki/Unicode#History

ChrisA
 
D

Dave Angel

But Unicode has nothing to do with Guido, and it has existed for about 25
years (if I recall correctly).

Depends how you measure. According to [1], the work kinda began back
then (25 years ago being 1988), but it wasn't till 1991/92 that the
spec was published. Also, the full Unicode range with multiple planes
came about in 1996, with Unicode 2.0, so that could also be considered
the beginning of Unicode. But that still means it's nearly old enough
to drink, so programmers ought to be aware of it.

Well, then I'm glad I stuck the qualifier on it. I remember where I was
working, and that company folded in 1992. I was working on NT long
before its official release in 1993, and it used Unicode, even if the
spec was sliding along. I'm sure I got unofficial versions of things
through Microsoft, at the time.
 
C

Chris Angelico

But Unicode has nothing to do with Guido, and it has existed for about 25
years (if I recall correctly).


Depends how you measure. According to [1], the work kinda began back
then (25 years ago being 1988), but it wasn't till 1991/92 that the
spec was published. Also, the full Unicode range with multiple planes
came about in 1996, with Unicode 2.0, so that could also be considered
the beginning of Unicode. But that still means it's nearly old enough
to drink, so programmers ought to be aware of it.

Well, then I'm glad I stuck the qualifier on it. I remember where I was
working, and that company folded in 1992. I was working on NT long before
its official release in 1993, and it used Unicode, even if the spec was
sliding along. I'm sure I got unofficial versions of things through
Microsoft, at the time.

No doubt! Of course, this list is good at dealing with the hard facts
and making sure the archives are accurate, but that doesn't change
your memory.

Anyway, your fundamental point isn't materially affected by whether
Unicode is 17 or 25 years old. It's been around plenty long enough by
now, we should use it. Same with IPv6, too...

ChrisA
 
M

MRAB

Characters do not have a width.

[snip]

It depends what you mean by "width"! :)

Try this (Python 3):
print("A\N{FULLWIDTH LATIN CAPITAL LETTER A}")
AA

Serious question: How would one find the width of a character by that
definition?
'F'

The possible widths are:

N = Neutral
A = Ambiguous
H = Halfwidth
W = Wide
F = Fullwidth
Na = Narrow

All you then need to do is find out what those actually mean...
 
S

Steven D'Aprano

On 08/07/2013 21:56, Dave Angel wrote:
Characters do not have a width.

[snip]

It depends what you mean by "width"! :)

Try this (Python 3):

print("A\N{FULLWIDTH LATIN CAPITAL LETTER A}")
AA

Serious question: How would one find the width of a character by that
definition?
import unicodedata
unicodedata.east_asian_width("A") 'Na'
unicodedata.east_asian_width("\N{FULLWIDTH LATIN CAPITAL LETTER
A}")
'F'

The possible widths are:

N = Neutral
A = Ambiguous
H = Halfwidth
W = Wide
F = Fullwidth
Na = Narrow

All you then need to do is find out what those actually mean...

In some East-Asian encodings, there are code-points for Latin characters
in two forms: "half-width" and "full-width". The half-width form took up
a single fixed-width column; the full-width forms took up two fixed-width
columns, so they would line up nicely in columns with Asian characters.

See also:

http://www.unicode.org/reports/tr11/

and search Wikipedia for "full-width" and "half-width".
 
S

Steven D'Aprano

But Unicode has nothing to do with Guido, and it has existed for about
25 years (if I recall correctly).

Depends how you measure. According to [1], the work kinda began back
then (25 years ago being 1988), but it wasn't till 1991/92 that the spec
was published. Also, the full Unicode range with multiple planes came
about in 1996, with Unicode 2.0, so that could also be considered the
beginning of Unicode. But that still means it's nearly old enough to
drink, so programmers ought to be aware of it.

Yes, yes, a thousand times yes. It's really not that hard to get the
basics of Unicode.

"When I discovered that the popular web development tool PHP has almost
complete ignorance of character encoding issues, blithely using 8 bits
for characters, making it darn near impossible to develop good
international web applications, I thought, enough is enough.

So I have an announcement to make: if you are a programmer working in
2003 and you don't know the basics of characters, character sets,
encodings, and Unicode, and I catch you, I'm going to punish you by
making you peel onions for 6 months in a submarine. I swear I will."

http://www.joelonsoftware.com/articles/Unicode.html

Also: http://nedbatchelder.com/text/unipain.html




To start with, if you're writing code for Python 2.x, and not using u''
for strings, then you're making a rod for your own back. Do yourself a
favour and get into the habit of always using u'' strings in Python 2.


I'll-start-taking-my-own-advice-next-week-I-promise-ly yrs,
 
S

Steven D'Aprano

Not using python 3, for me (a programmer which was present at the
beginning of computer science, badly interacting with many languages
from assembler to Fortran and from c to Pascal and so on) it was an hard
job to arrange the abrupt transition from characters only equal to bytes

Characters have *never* been equal to bytes. Not even Perl treats the
character 'A' as equal to the byte 0x0A:

if (0x0A eq 'A') {print "Equal\n";}
else {print "Unequal\n";}

will print Unequal, even if you replace "eq" with "==". Nor does Perl
consider the character 'A' equal to 65.

If you have learned to think of characters being equal to bytes, you have
learned wrong.

to some special characters defined with 2, 3 bytes and even more. I
should have preferred another solution... but i'm not Guido....!

What's a special character?

To an Italian, the characters J, K, W, X and Y are "special characters"
which do not exist in the ordinary alphabet. To a German, they are not
special, but S is special because you write SS as ß, but only in
lowercase.

To a mathematician, σ is just as ordinary as it would be to a Greek; but
the mathematician probably won't recognise Ï‚ unless she actually is
Greek, even though they are the same letter.

To an American electrician, Ω is an ordinary character, but ω isn't.

To anyone working with angles, or temperatures, the degree symbol ° is an
ordinary character, but the radian symbol is not. (I can't even find it.)

The English have forgotten that W used to be a ligature for VV, and
consider it a single ordinary character. But the ligature Æ is considered
an old-fashioned way of writing AE.

But to Danes and Norwegians, Æ is an ordinary letter, as distinct from AE
as TH is from Þ. (Which English used to have.) And so on...

I don't know what a special character is, unless it is the ASCII NUL
character, since that terminates C strings.
 
W

wxjmfauth

Le mardi 9 juillet 2013 09:00:02 UTC+2, Steven D'Aprano a écrit :
Characters have *never* been equal to bytes. Not even Perl treats the

character 'A' as equal to the byte 0x0A:



if (0x0A eq 'A') {print "Equal\n";}

else {print "Unequal\n";}



will print Unequal, even if you replace "eq" with "==". Nor does Perl

consider the character 'A' equal to 65.



If you have learned to think of characters being equal to bytes, you have

learned wrong.









What's a special character?



To an Italian, the characters J, K, W, X and Y are "special characters"

which do not exist in the ordinary alphabet. To a German, they are not

special, but S is special because you write SS as ß, but only in

lowercase.



To a mathematician, σ is just as ordinary as it would be to a Greek;but

the mathematician probably won't recognise Ï‚ unless she actually is

Greek, even though they are the same letter.



To an American electrician, Ω is an ordinary character, but ω isn't.



To anyone working with angles, or temperatures, the degree symbol ° is an

ordinary character, but the radian symbol is not. (I can't even find it.)



The English have forgotten that W used to be a ligature for VV, and

consider it a single ordinary character. But the ligature Æ is considered

an old-fashioned way of writing AE.



But to Danes and Norwegians, Æ is an ordinary letter, as distinct from AE

as TH is from Þ. (Which English used to have.) And so on...



I don't know what a special character is, unless it is the ASCII NUL

character, since that terminates C strings.


--------

The concept of "special characters" does not exist.
However, the definition of a "character" is a problem
per se (character, glyph, grapheme, ...).

You are confusing Unicode, typography and linguistic.

There is no symbole for radian because mathematically
radian is a pure number, a unitless number. You can
hower sepecify a = ... in radian (rad).

Note the difference between SS and ẞ
'FRANZ-JOSEF-STRAUSS-STRAẞE'

jmf
 
C

Chris “Kwpolska†Warrick

Note the difference between SS and ẞ
'FRANZ-JOSEF-STRAUSS-STRAẞE'

This is a capital Eszett. Which just happens not to exist in German.
Germans do not use this character, it is not available on German
keyboards, and the German spelling rules have you replace ß with SS.
And, surprise surprise, STRASSE is the example the Council for German
Orthography used ([0] page 29, §25 E3).

[0]: http://www.neue-rechtschreibung.de/regelwerk.pdf
 
N

Neil Cerutti

I appreciate you've been around a long time, and worked in a
lot of languages. I've programmed professionally in at least
35 languages since 1967. But we've come a long way from the
6bit characters I used in 1968. At that time, we packed them
10 characters to each word.

One of the first Python project I undertook was a program to dump
the ZSCII strings from Infocom game files. They are mostly packed
one character per 5 bits, with escapes to (I had to recheck the
Z-machine spec) latin-1. Oh, those clever implementors: thwarting
hexdumping cheaters and cramming their games onto microcomputers
with one blow.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,954
Messages
2,570,116
Members
46,704
Latest member
BernadineF

Latest Threads

Top