a little parsing challenge ☺

X

Xah Lee

2011-07-16

folks, this one will be interesting one.

the problem is to write a script that can check a dir of text files
(and all subdirs) and reports if a file has any mismatched matching
brackets.
…

Ok, here's my solution (pasted at bottom). I haven't tried to make it
elegant or terse, yet, seeing that many are already much elegent than
i could possibly do so with my code.

my solution basically use a stack. (i think all of us are doing
similar) Here's the steps:

• Go thru the file char by char, find a bracket char.
• check if the one on stack is a matching opening char. If so remove
it. Else, push the current onto the stack.
• Repeat the above till end of file.
• If the stack is not empty, then the file got mismatched brackets.
Report it.
• Do the above on all files.

Many elegant solutions. Raymond Hettinger is very quick, posted a
solution only after a hour or so when i posted it. Many others are
very short, very nice. Thank you all for writing them. I haven't
studied them yet. I'll run them all and post a summary in 2 days. (i
have few thousands files to run this test thru, many of them have
mismatched brackets. So i have good data to test with.)

PS we still lack a perl, Scheme lisp, tcl, lua versions. These
wouldn't be hard and would be interesting to read. If you are picking
up one of these lang, this would be a good exercise. Haskell too. I
particularly would like to see a javascript version ran from command
line. Maybe somebody can put this exercise to Google folks ... they
are like the js gods.

also, now that we have these home-brewed code, how'd a parser expert
do it? Is it possible to make it even simpler by using some parser
tools? (have no idea what those lex yacc do, or modern incarnations)
I've also been thinking whether this can be done with Parsing
Expression Grammar. That would make the code semantics really elegant
(as opposed home-cooked stack logic).

Xah

;; -*- coding: utf-8 -*-
;; 2011-07-15, Xah Lee
;; go thru a file, check if all brackets are properly matched.
;; e.g. good: (…{…}… “…â€â€¦)
;; bad: ( [)]
;; bad: ( ( )

(setq inputDir "~/web/xahlee_org/p/") ; must end in slash

(defvar matchPairs '() "a alist. For each air, the car is opening
char, cdr is closing char.")

(setq matchPairs '(
("(" . ")")
("{" . "}")
("[" . "]")
("“" . "â€")
("‹" . "›")
("«" . "»")
("ã€" . "】")
("〈" . "〉")
("《" . "》")
("「" . "ã€")
("『" . "ã€")
)
)

(defvar searchRegex "" "regex string of all pairs to search.")
(setq searchRegex "")
(mapc
(lambda (mypair) ""
(setq searchRegex (concat searchRegex (regexp-quote (car mypair))
"|" (regexp-quote (cdr mypair)) "|") )
)
matchPairs)

(setq searchRegex (replace-regexp-in-string "|$" "" searchRegex t
t)) ; remove the ending “|â€

(setq searchRegex (replace-regexp-in-string "|" "\\|" searchRegex t
t)) ; change | to \\| for regex “or†operation

(defun my-process-file (fpath)
"process the file at fullpath fpath ..."
(let (myBuffer (ii 0) myStack ξchar ξpos)

(setq myStack '() ) ; each element is a vector [char position]
(setq ξchar "")

(setq myBuffer (get-buffer-create " myTemp"))
(set-buffer myBuffer)
(insert-file-contents fpath nil nil nil t)

(goto-char 1)
(while (search-forward-regexp searchRegex nil t)
(setq ξpos (point) )
(setq ξchar (buffer-substring-no-properties ξpos (- ξpos 1)) )

;; (princ (format "-----------------------------\nfound char: %s
\n" ξchar) )

(let ((isClosingCharQ nil) (matchedOpeningChar nil) )
(setq isClosingCharQ (rassoc ξchar matchPairs))
(when isClosingCharQ (setq matchedOpeningChar (car
isClosingCharQ) ) )

;; (princ (format "isClosingCharQ is: %s\n" isClosingCharQ) )
;; (princ (format "matchedOpeningChar is: %s\n"
matchedOpeningChar) )

(if
(and
(car myStack) ; not empty
(equal (elt (car myStack) 0) matchedOpeningChar )
)
(progn
;; (princ (format "matched this bottom item on stack: %s
\n" (car myStack)) )
(setq myStack (cdr myStack) )
)
(progn
;; (princ (format "did not match this bottom item on
stack: %s\n" (car myStack)) )
(setq myStack (cons (vector ξchar ξpos) myStack) ) )
)
)
;; (princ "current stack: " )
;; (princ myStack )
;; (terpri )
)

(when (not (equal myStack nil))
(princ "Error file: ")
(princ fpath)
(print (car myStack) )
)
(kill-buffer myBuffer)
))


;; (require 'find-lisp)

(let (outputBuffer)
(setq outputBuffer "*xah match pair output*" )
(with-output-to-temp-buffer outputBuffer
(mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
(princ "Done deal!")
)
)
 
T

Thomas 'PointedEars' Lahn

Rouslan said:
I don't know why, but I just had to try it (even though I don't usually
use Perl and had to look up a lot of stuff). I came up with this:

I don't know why … you replied to my posting/e-mail (but quoted nothing from
it, much less referred to its content), and posted a lot of Perl code in a
Python newsgroup/on a Python mailing list.
 
B

Billy Mays

2011-07-16

I gave it a shot. It doesn't do any of the Unicode delims, because
let's face it, Unicode is for goobers.


import sys, os

pairs = {'}':'{', ')':'(', ']':'[', '"':'"', "'":"'", '>':'<'}
valid = set( v for pair in pairs.items() for v in pair )

for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
for name in filenames:
stack = [' ']
with open(os.path.join(dirpath, name), 'rb') as f:
chars = (c for line in f for c in line if c in valid)
for c in chars:
if c in pairs and stack[-1] == pairs[c]:
stack.pop()
else:
stack.append(c)
print ("Good" if len(stack) == 1 else "Bad") + ': %s' % name
 
I

Ian Kelly

I gave it a shot.  It doesn't do any of the Unicode delims, because let's
face it, Unicode is for goobers.

Uh, okay...

Your script also misses the requirement of outputting the index or row
and column of the first mismatched bracket.
 
R

Rouslan Korneychuk

I don't know why … you replied to my posting/e-mail (but quoted nothing from
it, much less referred to its content), and posted a lot of Perl code in a
Python newsgroup/on a Python mailing list.

Well, when I said I had to try *it*, I was referring to using a Perl
compatible regular expression, which you brought up. I guess I should
have quoted that part. As for what I posted, the crux of it was a single
regular expression. The Perl code at the bottom was just to point out
that I didn't type that monstrosity out manually. I was going to put
that part in brackets but there were already so many.
 
S

sln

2011-07-16

folks, this one will be interesting one.

the problem is to write a script that can check a dir of text files
(and all subdirs) and reports if a file has any mismatched matching
brackets.
[snip]
i hope you'll participate. Just post solution here. Thanks.

I have to hunt for a job so I'm not writing a solution for you.
Here is a thin regex framework that may get you started.

-sln

---------------------

use strict;
use warnings;

my @samples = qw(
A98(y[(np)r]x)tp[kk]a.exeb
A98(y[(np)r]x)tp[kk]a}.exeb
A98(‹ynprx)tpk›ka.mpeg
‹A98(ynprx)tpk›ka
“A9«8(yn«pr{{[g[x].}*()+}»)tpkka».”
“A9«8(yn«pr{{[g[x].]}*()+}»)tpkka».”
“A9«8(yn«pr»)tpkka».”
“A9«8(yn«pr»)»”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»”
“A9«8(yn«pr»)”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»”
);

my $regex = qr/

^ (?&FileName) $

(?(DEFINE)

(?<Delim>
\( (?&Content) \)
| \{ (?&Content) \}
| \[ (?&Content) \]
| \“ (?&Content) \”
| \‹ (?&Content) \›
| \« (?&Content) \»
# add more here ..
)

(?<Content>
(?: (?> [^(){}\[\]“”‹›«»]+ ) # add more here ..
| (?&Delim)
)*
)

(?<FileName>
(?&Content)
)
)
/x;


for (@samples)
{
print "$_ - ";
if ( /$regex/ ) {
print "passed \n";
}
else {
print "failed \n";
}
}

__END__

Output:

A98(y[(np)r]x)tp[kk]a.exeb - passed
A98(y[(np)r]x)tp[kk]a}.exeb - failed
A98(‹ynprx)tpk›ka.mpeg - failed
‹A98(ynprx)tpk›ka - passed
“A9«8(yn«pr{{[g[x].}*()+}»)tpkka».” - failed
“A9«8(yn«pr{{[g[x].]}*()+}»)tpkka».” - passed
“A9«8(yn«pr»)tpkka».” - passed
“A9«8(yn«pr»)»”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»” - passed
“A9«8(yn«pr»)”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»” - failed
 
T

Thomas 'PointedEars' Lahn

Ian said:
Uh, okay...

Your script also misses the requirement of outputting the index or row
and column of the first mismatched bracket.

Thanks to Python's expressiveness, this can be easily remedied (see below).

I also do not follow Billy's comment about Unicode. Unicode and the fact
that Python supports it *natively* cannot be appreciated enough in a
globalized world.

However, I have learned a lot about being pythonic from his posting (take
those generator expressions, for example!), and the idea of looking at the
top of a stack for reference is a really good one. Thank you, Billy!

Here is my improvement of his code, which should fill the mentioned gaps.
I have also reversed the order in the report line as I think it is more
natural this way. I have tested the code superficially with a directory
containing a single text file. Watch for word-wrap:

# encoding: utf-8
'''
Created on 2011-07-18

@author: Thomas 'PointedEars' Lahn <[email protected]>, based on an idea of
Billy Mays <[email protected]>
in <'''
import sys, os

pairs = {u'}': u'{', u')': u'(', u']': u'[',
u'â€': u'“', u'›': u'‹', u'»': u'«',
u'】': u'ã€', u'〉': u'〈', u'》': u'《',
u'ã€': u'「', u'ã€': u'『'}
valid = set(v for pair in pairs.items() for v in pair)

if __name__ == '__main__':
for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
for name in filenames:
stack = [' ']

# you can use chardet etc. instead
encoding = 'utf-8'

with open(os.path.join(dirpath, name), 'r') as f:
reported = False
chars = ((c, line_no, col) for line_no, line in enumerate(f)
for col, c in enumerate(line.decode(encoding)) if c in valid)
for c, line_no, col in chars:
if c in pairs:
if stack[-1] == pairs[c]:
stack.pop()
else:
if not reported:
first_bad = (c, line_no + 1, col + 1)
reported = True
else:
stack.append(c)

print '%s: %s' % (name, ("good" if len(stack) == 1 else "bad
'%s' at %s:%s" % first_bad))
 
S

Steven D'Aprano

Billy said:
I gave it a shot. It doesn't do any of the Unicode delims, because
let's face it, Unicode is for goobers.

Goobers... that would be one of those new-fangled slang terms that the young
kids today use to mean its opposite, like "bad", "wicked" and "sick",
correct?

I mention it only because some people might mistakenly interpret your words
as a childish and feeble insult against the 98% of the world who want or
need more than the 127 characters of ASCII, rather than understand you
meant it as a sign of the utmost respect for the richness and diversity of
human beings and their languages, cultures, maths and sciences.
 
B

Billy Mays

Goobers... that would be one of those new-fangled slang terms that the young
kids today use to mean its opposite, like "bad", "wicked" and "sick",
correct?

I mention it only because some people might mistakenly interpret your words
as a childish and feeble insult against the 98% of the world who want or
need more than the 127 characters of ASCII, rather than understand you
meant it as a sign of the utmost respect for the richness and diversity of
human beings and their languages, cultures, maths and sciences.

TL;DR version: international character sets are a problem, and Unicode
is not the answer to that problem).

As long as I have used python (which I admit has only been 3 years)
Unicode has never appeared to be implemented correctly. I'm probably
repeating old arguments here, but whatever.

Unicode is a mess. When someone says ASCII, you know that they can only
mean characters 0-127. When someone says Unicode, do the mean real
Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8?
When using the 'u' datatype with the array module, the docs don't even
tell you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that
all the of these can be figured out, but the problem is now I have to
ask every one of these questions whenever I want to use strings.

Secondly, Python doesn't do Unicode exception handling correctly. (but I
suspect that its a broader problem with languages) A good example of
this is with UTF-8 where there are invalid code points ( such as 0xC0,
0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
well as everyone else who wants to use strings for some reason).

When embedding Python in a long running application where user input is
received, it is very easy to make mistake which bring down the whole
program. If any user string isn't properly try/excepted, a user could
craft a malformed string which a UTF-8 decoder would choke on. Using
ASCII (or whatever 8 bit encoding) doesn't have these problems since all
codepoints are valid.

Another (this must have been a good laugh amongst the UniDevs) 'feature'
of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
Any string can masquerade as any other string by placing few of these
in a string. Any word filters you might have are now defeated by some
cheesy Unicode nonsense character. Can you just just check for these
characters and strip them out? Yes. Should you have to? I would say no.

Does it get better? Of course! international character sets used for
domain name encoding use yet a different scheme (Punycode). Are the
following two domain names the same: tést.com , xn--tst-bma.com ? Who
knows!

I suppose I can gloss over the pains of using Unicode in C with every
string needing to be an LPS since 0x00 is now a valid code point in
UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
strlen or concatenation operations.

Can it get even better? Yep. We also now need to have a Byte order
Mark (BOM) to determine the endianness of our characters. Are they
little endian or big endian? (or perhaps one of the two possible middle
endian encodings?) Who knows? String processing with unicode is
unpleasant to say the least. I suppose that's what we get when we
things are designed by committee.

But Hey! The great thing about standards is that there are so many to
choose from.
 
R

rusi

TL;DR version: international character sets are a problem, and Unicode
is not the answer to that problem).

As long as I have used python (which I admit has only been 3 years)
Unicode has never appeared to be implemented correctly.  I'm probably
repeating old arguments here, but whatever.

Unicode is a mess.  When someone says ASCII, you know that they can only
mean characters 0-127.  When someone says Unicode, do the mean real
Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8?
When using the 'u' datatype with the array module, the docs don't even
tell you if its 2 bytes wide or 4 bytes.  Which is it?  I'm sure that
all the of these can be figured out, but the problem is now I have to
ask every one of these questions whenever I want to use strings.

Secondly, Python doesn't do Unicode exception handling correctly. (but I
suspect that its a broader problem with languages) A good example of
this is with UTF-8 where there are invalid code points ( such as 0xC0,
0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
well as everyone else who wants to use strings for some reason).

When embedding Python in a long running application where user input is
received, it is very easy to make mistake which bring down the whole
program.  If any user string isn't properly try/excepted, a user could
craft a malformed string which a UTF-8 decoder would choke on.  Using
ASCII (or whatever 8 bit encoding) doesn't have these problems since all
codepoints are valid.

Another (this must have been a good laugh amongst the UniDevs) 'feature'
of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
Any string can masquerade as any other string by placing  few of these
in a string.  Any word filters you might have are now defeated by some
cheesy Unicode nonsense character.  Can you just just check for these
characters and strip them out?  Yes.  Should you have to?  I would say no.

Does it get better?  Of course! international character sets used for
domain name encoding use yet a different scheme (Punycode).  Are the
following two domain names the same: tést.com , xn--tst-bma.com ?  Who
knows!

I suppose I can gloss over the pains of using Unicode in C with every
string needing to be an LPS since 0x00 is now a valid code point in
UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
strlen or concatenation operations.

Can it get even better?  Yep.  We also now need to have a Byte order
Mark (BOM) to determine the endianness of our characters.  Are they
little endian or big endian?  (or perhaps one of the two possible middle
endian encodings?)  Who knows?  String processing with unicode is
unpleasant to say the least.  I suppose that's what we get when we
things are designed by committee.

But Hey!  The great thing about standards is that there are so many to
choose from.

Thanks for writing that
Every time I try to understand unicode and remain stuck I come to the
conclusion that I must be an imbecile.
Seeing others (probably more intelligent than yours truly) gives me
some solace!

[And I am writing this from India where there are dozens of languages,
almost as many scripts and everyone speaks and writes at least a
couple of non-european ones]
 
M

MRAB

TL;DR version: international character sets are a problem, and Unicode
is not the answer to that problem).

As long as I have used python (which I admit has only been 3 years)
Unicode has never appeared to be implemented correctly. I'm probably
repeating old arguments here, but whatever.

Unicode is a mess. When someone says ASCII, you know that they can only
mean characters 0-127. When someone says Unicode, do the mean real
Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When
using the 'u' datatype with the array module, the docs don't even tell
you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that all the
of these can be figured out, but the problem is now I have to ask every
one of these questions whenever I want to use strings.
That's down to whether it's a narrow or wide Python build. There's a
PEP suggesting a fix for that (PEP 393).
Secondly, Python doesn't do Unicode exception handling correctly. (but I
suspect that its a broader problem with languages) A good example of
this is with UTF-8 where there are invalid code points ( such as 0xC0,
0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
well as everyone else who wants to use strings for some reason).
Those aren't codepoints, those are invalid bytes for the UTF-8 encoding.
When embedding Python in a long running application where user input is
received, it is very easy to make mistake which bring down the whole
program. If any user string isn't properly try/excepted, a user could
craft a malformed string which a UTF-8 decoder would choke on. Using
ASCII (or whatever 8 bit encoding) doesn't have these problems since all
codepoints are valid.
What if you give an application an invalid JPEG, PNG or other image
file? Does that mean that image formats are bad too?
Another (this must have been a good laugh amongst the UniDevs) 'feature'
of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
Any string can masquerade as any other string by placing few of these in
a string. Any word filters you might have are now defeated by some
cheesy Unicode nonsense character. Can you just just check for these
characters and strip them out? Yes. Should you have to? I would say no.

Does it get better? Of course! international character sets used for
domain name encoding use yet a different scheme (Punycode). Are the
following two domain names the same: tést.com , xn--tst-bma.com ? Who
knows!

I suppose I can gloss over the pains of using Unicode in C with every
string needing to be an LPS since 0x00 is now a valid code point in
UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
strlen or concatenation operations.
0x00 is also a valid ASCII code, but C doesn't let you use it!

There's also "Modified UTF-8", in which U+0000 is encoded as 2 bytes,
so that zero-byte can be used as a terminator. You can't do that in
ASCII! :)
Can it get even better? Yep. We also now need to have a Byte order Mark
(BOM) to determine the endianness of our characters. Are they little
endian or big endian? (or perhaps one of the two possible middle endian
encodings?) Who knows? String processing with unicode is unpleasant to
say the least. I suppose that's what we get when we things are designed
by committee.
Proper UTF-8 doesn't have a BOM.

The rule (in Python, at least) is to decode on input and encode on
output. You don't have to worry about endianness when processing
Unicode strings internally; they're just a series of codepoints.
 
B

Benjamin Kaplan

TL;DR version: international character sets are a problem, and Unicode isnot the answer to that problem).

As long as I have used python (which I admit has only been 3 years) Unicode has never appeared to be implemented correctly.  I'm probably repeating old arguments here, but whatever.

Unicode is a mess.  When someone says ASCII, you know that they can only mean characters 0-127.  When someone says Unicode, do the mean real Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When usingthe 'u' datatype with the array module, the docs don't even tell you if its 2 bytes wide or 4 bytes.  Which is it?  I'm sure that all the of these can be figured out, but the problem is now I have to ask every one of these questions whenever I want to use strings.

It doesn't matter. When you use the unicode data type in Python, you
get to treat it as a sequence of characters, not a sequence of bytes.
The fact that it's stored internally as UCS-2 or UCS-4 is irrelevant.
Secondly, Python doesn't do Unicode exception handling correctly. (but I suspect that its a broader problem with languages) A good example of this is with UTF-8 where there are invalid code points ( such as 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as well as everyone else who wants to use strings for some reason).

A Unicode code point is of the form U+XXXX. 0xC0 is not a Unicode code
point, it is a byte. It happens to be an invalid byte using the UTF-8
byte encoding (which is not Unicode, it's a byte string). The Unicode
code point U+00C0 is perfectly valid- it's a LATIN CAPITAL LETTER A
WITH GRAVE.
When embedding Python in a long running application where user input is received, it is very easy to make mistake which bring down the whole program..  If any user string isn't properly try/excepted, a user could craft a malformed string which a UTF-8 decoder would choke on.  Using ASCII (or whatever 8 bit encoding) doesn't have these problems since all codepoints arevalid.

UTF-8 != Unicode. UTF-8 is one of several byte encodings capable of
representing every character in the Unicode spec, but it is not
Unicode. If you have a Unicode string, it is not a sequence of byes,
it is a sequence of characters. If you want a sequence of bytes, use a
byte string. If you are attempting to interpret a sequence of bytes as
a sequence of text, you're doing it wrong. There's a reason we have
both text and binary modes for opening files- yes, there is a
difference between them.
Another (this must have been a good laugh amongst the UniDevs) 'feature' of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). Any string can masquerade as any other string by placing  few of these in a string.  Any word filters you might have are now defeated by some cheesy Unicode nonsense character.  Can you just just check for these characters and strip them out?  Yes.  Should you have to?  I would say no.

Does it get better?  Of course! international character sets used for domain name encoding use yet a different scheme (Punycode).  Are the following two domain names the same: tést.com , xn--tst-bma.com ?  Who knows!

I suppose I can gloss over the pains of using Unicode in C with every string needing to be an LPS since 0x00 is now a valid code point in UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do strlen or concatenation operations.

That is using UTF-8 in C. Which, again, is not the same thing as Unicode.
Can it get even better?  Yep.  We also now need to have a Byte order Mark (BOM) to determine the endianness of our characters.  Are they little endian or big endian?  (or perhaps one of the two possible middle endian encodings?)  Who knows?  String processing with unicode is unpleasantto say the least.  I suppose that's what we get when we things are designed by committee.

And that is UTF-16 and UTF-32. Again, those are byte encodings. They
are not Unicode. When you use a library capable of handling Unicode,
you never see those- you just have a string with characters in it.
 
S

Steven D'Aprano

Billy said:
TL;DR version: international character sets are a problem, and Unicode
is not the answer to that problem).

Shorter version: FUD.

Yes, having a rich and varied character set requires work. Yes, the Unicode
standard itself, and any interface to it (including Python's) are imperfect
(like anything created by fallible humans). But your post is a long and
tedious list of FUD with not one bit of useful advice.

I'm not going to go through the whole post -- life is too short. But here
are two especially egregious example showing that you have some fundamental
misapprehensions about what Unicode actually is:
Python doesn't do Unicode exception handling correctly. (but I
suspect that its a broader problem with languages) A good example of
this is with UTF-8 where there are invalid code points ( such as 0xC0,
0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
well as everyone else who wants to use strings for some reason).

and then later:
Another (this must have been a good laugh amongst the UniDevs) 'feature'
of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).


This is confused. Unicode text has code points, text which has been encoded
is nothing but bytes and not code points. "UTF-8 code point" does not even
mean anything.

The zero width space has code point U+200B. The bytes you get depend on
which encoding you want:
'\xff\xfe\x0b '

But regardless of which bytes it is encoded into, ZWS always has just a
single code point: U+200B.

You say "A good example of this is with UTF-8 where there are invalid code
points ( such as 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF" but I don't
even understand why you think this is a problem with Unicode.

0xC0 is not a code point, it is a byte. Not all combinations of bytes are
legal in all files. If you have byte 0xC0 in a file, it cannot be an ASCII
file: there is no ASCII character represented by byte 0xC0, because hex
0xCO = 192, which is larger than 127.

Likewise, if you have a 0xC0 byte in a file, it cannot be UTF-8. It is as
simple as that. Trying to treat it as UTF-8 will give an error, just as
trying to view a mp3 file as if it were a jpeg will give an error. Why you
imagine this is a problem for Unicode is beyond me.
 
R

rusi


Yes Ive read that and understood a little bit more thanks to it.
But for the points raised in this thread this one from Joel is more
relevant:

http://www.joelonsoftware.com/articles/LeakyAbstractions.html

Some evidences of leakiness:
code point vs character vs byte
encoding and decoding
UTF-x and UCS-y

Very important and necessary distinctions? Maybe... But I did not need
them when my world was built of the 127 bricks of ASCII.

My latest brush with unicode was when I tried to port construct to
python3. http://construct.wikispaces.com/

If unicode 'just works' you should be able to do it in a jiffy?
[And if you did I would be glad to be proved wrong :) ]
 
C

Chris Angelico

Some evidences of leakiness:
code point vs character vs byte
encoding and decoding
UTF-x and UCS-y

Very important and necessary distinctions? Maybe... But I did not need
them when my world was built of the 127 bricks of ASCII.

Codepoint vs byte is NOT an abstraction. Unicode consists of
characters, where each character is represented by a number called its
codepoint. Since computers work with bytes, we need a way of encoding
those characters into bytes. It's no different from encoding a piece
of music in bytes, and having it come out as 0x90 0x64 0x40. Are those
bytes an abstraction of the note? No. They're an encoding of a MIDI
message that requests that the note be struck. The note itself is an
abstraction, if you like; but the bytes to create that note could be
delivered in a variety of other ways.

A Python Unicode string, whether it's Python 2's 'unicode' or Python
3's 'str', is a sequence of characters. Since those characters are
stored in memory, they must be encoded somehow, but that's not our
problem. We need only care about encoding when we save those
characters to disk, transmit them across the network, or in some other
way need to store them as bytes. Otherwise, there is no abstraction,
and no leak.

Chris Angelico
 
T

Thomas 'PointedEars' Lahn

Thomas said:
with open(os.path.join(dirpath, name), 'r') as f:

SHOULD be

with open(os.path.join(dirpath, name), 'rb') as f:

(as in the original), else the some code units might not be read properly.
 
X

Xah Lee

TL;DR version: international character sets are a problem, and Unicode
is not the answer to that problem).

As long as I have used python (which I admit has only been 3 years)
Unicode has never appeared to be implemented correctly.  I'm probably
repeating old arguments here, but whatever.

Unicode is a mess.  When someone says ASCII, you know that they can only
mean characters 0-127.  When someone says Unicode, do the mean real
Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8?
When using the 'u' datatype with the array module, the docs don't even
tell you if its 2 bytes wide or 4 bytes.  Which is it?  I'm sure that
all the of these can be figured out, but the problem is now I have to
ask every one of these questions whenever I want to use strings.

Secondly, Python doesn't do Unicode exception handling correctly. (but I
suspect that its a broader problem with languages) A good example of
this is with UTF-8 where there are invalid code points ( such as 0xC0,
0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
well as everyone else who wants to use strings for some reason).

When embedding Python in a long running application where user input is
received, it is very easy to make mistake which bring down the whole
program.  If any user string isn't properly try/excepted, a user could
craft a malformed string which a UTF-8 decoder would choke on.  Using
ASCII (or whatever 8 bit encoding) doesn't have these problems since all
codepoints are valid.

Another (this must have been a good laugh amongst the UniDevs) 'feature'
of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
Any string can masquerade as any other string by placing  few of these
in a string.  Any word filters you might have are now defeated by some
cheesy Unicode nonsense character.  Can you just just check for these
characters and strip them out?  Yes.  Should you have to?  I would say no.

Does it get better?  Of course! international character sets used for
domain name encoding use yet a different scheme (Punycode).  Are the
following two domain names the same: tést.com , xn--tst-bma.com ?  Who
knows!

I suppose I can gloss over the pains of using Unicode in C with every
string needing to be an LPS since 0x00 is now a valid code point in
UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
strlen or concatenation operations.

Can it get even better?  Yep.  We also now need to have a Byte order
Mark (BOM) to determine the endianness of our characters.  Are they
little endian or big endian?  (or perhaps one of the two possible middle
endian encodings?)  Who knows?  String processing with unicode is
unpleasant to say the least.  I suppose that's what we get when we
things are designed by committee.

But Hey!  The great thing about standards is that there are so many to
choose from.

might check out my take

〈Xah's Unicode Tutorial〉
http://xahlee.org/Periodic_dosage_dir/unicode.html

especially good for emacs users.

if you grew up with english, unicode might seem complex or difficult
due to unfamiliarity.

but for asian people, when you dont have alphabets, it's kinda strange
to think that a byte is char. The notion simply don't exist and
impossible to establish. There are many encodings for chinese before
unicode. Even today, unicode isn't used in taiwan or china. Taiwan
uses big5, china uses GB18030, which contains all chars of unicode.

~8 years ago i thought that it'd be great if china adopted unicode
sometimes in the future... so that we all just have one charset to
deal with. But that's never gonna happen. On the contrary, am thinking
now there's the possibility that the world adopts GB18030 someday. lol
if you go to alexa.com for traffic ranking, a good percentage of the
top few are chinese these days. more and more as i observed since mid
2000s.

by the way, here's what these matching pairs are used for.

‹french quote›
«french quote»

the 〈〉 《》 are chinese brackets used for book titles etc. (CD, TV
program, show title, etc.)
the 「〠『〠are traditional chinese quotes, like english's ‘sinle
curly’, “double curlyâ€
the ã€ã€‘ 〖〗 〔〕 and few othersare variant brakets, similar to english's
() {} [].

Xah
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,091
Messages
2,570,605
Members
47,225
Latest member
DarrinWhit

Latest Threads

Top