R
Rouslan Korneychuk
That's solid Perl. Both the code generator and the generated code are
unreadable. Well done!
Stefan
Why, thank you.
That's solid Perl. Both the code generator and the generated code are
unreadable. Well done!
Stefan
2011-07-16
folks, this one will be interesting one.
the problem is to write a script that can check a dir of text files
(and all subdirs) and reports if a file has any mismatched matching
brackets.
…
Rouslan said:I don't know why, but I just had to try it (even though I don't usually
use Perl and had to look up a lot of stuff). I came up with this:
2011-07-16
I gave it a shot. It doesn't do any of the Unicode delims, because let's
face it, Unicode is for goobers.
I don't know why … you replied to my posting/e-mail (but quoted nothing from
it, much less referred to its content), and posted a lot of Perl code in a
Python newsgroup/on a Python mailing list.
2011-07-16
folks, this one will be interesting one.
the problem is to write a script that can check a dir of text files
(and all subdirs) and reports if a file has any mismatched matching
brackets.
[snip]
i hope you'll participate. Just post solution here. Thanks.
Ian said:Uh, okay...
Your script also misses the requirement of outputting the index or row
and column of the first mismatched bracket.
Billy said:I gave it a shot. It doesn't do any of the Unicode delims, because
let's face it, Unicode is for goobers.
Goobers... that would be one of those new-fangled slang terms that the young
kids today use to mean its opposite, like "bad", "wicked" and "sick",
correct?
I mention it only because some people might mistakenly interpret your words
as a childish and feeble insult against the 98% of the world who want or
need more than the 127 characters of ASCII, rather than understand you
meant it as a sign of the utmost respect for the richness and diversity of
human beings and their languages, cultures, maths and sciences.
TL;DR version: international character sets are a problem, and Unicode
is not the answer to that problem).
As long as I have used python (which I admit has only been 3 years)
Unicode has never appeared to be implemented correctly. I'm probably
repeating old arguments here, but whatever.
Unicode is a mess. When someone says ASCII, you know that they can only
mean characters 0-127. When someone says Unicode, do the mean real
Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8?
When using the 'u' datatype with the array module, the docs don't even
tell you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that
all the of these can be figured out, but the problem is now I have to
ask every one of these questions whenever I want to use strings.
Secondly, Python doesn't do Unicode exception handling correctly. (but I
suspect that its a broader problem with languages) A good example of
this is with UTF-8 where there are invalid code points ( such as 0xC0,
0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
well as everyone else who wants to use strings for some reason).
When embedding Python in a long running application where user input is
received, it is very easy to make mistake which bring down the whole
program. If any user string isn't properly try/excepted, a user could
craft a malformed string which a UTF-8 decoder would choke on. Using
ASCII (or whatever 8 bit encoding) doesn't have these problems since all
codepoints are valid.
Another (this must have been a good laugh amongst the UniDevs) 'feature'
of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
Any string can masquerade as any other string by placing few of these
in a string. Any word filters you might have are now defeated by some
cheesy Unicode nonsense character. Can you just just check for these
characters and strip them out? Yes. Should you have to? I would say no.
Does it get better? Of course! international character sets used for
domain name encoding use yet a different scheme (Punycode). Are the
following two domain names the same: tést.com , xn--tst-bma.com ? Who
knows!
I suppose I can gloss over the pains of using Unicode in C with every
string needing to be an LPS since 0x00 is now a valid code point in
UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
strlen or concatenation operations.
Can it get even better? Yep. We also now need to have a Byte order
Mark (BOM) to determine the endianness of our characters. Are they
little endian or big endian? (or perhaps one of the two possible middle
endian encodings?) Who knows? String processing with unicode is
unpleasant to say the least. I suppose that's what we get when we
things are designed by committee.
But Hey! The great thing about standards is that there are so many to
choose from.
That's down to whether it's a narrow or wide Python build. There's aTL;DR version: international character sets are a problem, and Unicode
is not the answer to that problem).
As long as I have used python (which I admit has only been 3 years)
Unicode has never appeared to be implemented correctly. I'm probably
repeating old arguments here, but whatever.
Unicode is a mess. When someone says ASCII, you know that they can only
mean characters 0-127. When someone says Unicode, do the mean real
Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When
using the 'u' datatype with the array module, the docs don't even tell
you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that all the
of these can be figured out, but the problem is now I have to ask every
one of these questions whenever I want to use strings.
Those aren't codepoints, those are invalid bytes for the UTF-8 encoding.Secondly, Python doesn't do Unicode exception handling correctly. (but I
suspect that its a broader problem with languages) A good example of
this is with UTF-8 where there are invalid code points ( such as 0xC0,
0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
well as everyone else who wants to use strings for some reason).
What if you give an application an invalid JPEG, PNG or other imageWhen embedding Python in a long running application where user input is
received, it is very easy to make mistake which bring down the whole
program. If any user string isn't properly try/excepted, a user could
craft a malformed string which a UTF-8 decoder would choke on. Using
ASCII (or whatever 8 bit encoding) doesn't have these problems since all
codepoints are valid.
0x00 is also a valid ASCII code, but C doesn't let you use it!Another (this must have been a good laugh amongst the UniDevs) 'feature'
of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
Any string can masquerade as any other string by placing few of these in
a string. Any word filters you might have are now defeated by some
cheesy Unicode nonsense character. Can you just just check for these
characters and strip them out? Yes. Should you have to? I would say no.
Does it get better? Of course! international character sets used for
domain name encoding use yet a different scheme (Punycode). Are the
following two domain names the same: tést.com , xn--tst-bma.com ? Who
knows!
I suppose I can gloss over the pains of using Unicode in C with every
string needing to be an LPS since 0x00 is now a valid code point in
UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
strlen or concatenation operations.
Proper UTF-8 doesn't have a BOM.Can it get even better? Yep. We also now need to have a Byte order Mark
(BOM) to determine the endianness of our characters. Are they little
endian or big endian? (or perhaps one of the two possible middle endian
encodings?) Who knows? String processing with unicode is unpleasant to
say the least. I suppose that's what we get when we things are designed
by committee.
rusi said:Every time I try to understand unicode and remain stuck I come to the
conclusion that I must be an imbecile.
TL;DR version: international character sets are a problem, and Unicode isnot the answer to that problem).
As long as I have used python (which I admit has only been 3 years) Unicode has never appeared to be implemented correctly. I'm probably repeating old arguments here, but whatever.
Unicode is a mess. When someone says ASCII, you know that they can only mean characters 0-127. When someone says Unicode, do the mean real Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When usingthe 'u' datatype with the array module, the docs don't even tell you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that all the of these can be figured out, but the problem is now I have to ask every one of these questions whenever I want to use strings.
Secondly, Python doesn't do Unicode exception handling correctly. (but I suspect that its a broader problem with languages) A good example of this is with UTF-8 where there are invalid code points ( such as 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as well as everyone else who wants to use strings for some reason).
When embedding Python in a long running application where user input is received, it is very easy to make mistake which bring down the whole program.. If any user string isn't properly try/excepted, a user could craft a malformed string which a UTF-8 decoder would choke on. Using ASCII (or whatever 8 bit encoding) doesn't have these problems since all codepoints arevalid.
Another (this must have been a good laugh amongst the UniDevs) 'feature' of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). Any string can masquerade as any other string by placing few of these in a string. Any word filters you might have are now defeated by some cheesy Unicode nonsense character. Can you just just check for these characters and strip them out? Yes. Should you have to? I would say no.
Does it get better? Of course! international character sets used for domain name encoding use yet a different scheme (Punycode). Are the following two domain names the same: tést.com , xn--tst-bma.com ? Who knows!
I suppose I can gloss over the pains of using Unicode in C with every string needing to be an LPS since 0x00 is now a valid code point in UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do strlen or concatenation operations.
Can it get even better? Yep. We also now need to have a Byte order Mark (BOM) to determine the endianness of our characters. Are they little endian or big endian? (or perhaps one of the two possible middle endian encodings?) Who knows? String processing with unicode is unpleasantto say the least. I suppose that's what we get when we things are designed by committee.
Billy said:TL;DR version: international character sets are a problem, and Unicode
is not the answer to that problem).
Python doesn't do Unicode exception handling correctly. (but I
suspect that its a broader problem with languages) A good example of
this is with UTF-8 where there are invalid code points ( such as 0xC0,
0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
well as everyone else who wants to use strings for some reason).
Another (this must have been a good laugh amongst the UniDevs) 'feature'
of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
Some evidences of leakiness:
code point vs character vs byte
encoding and decoding
UTF-x and UCS-y
Very important and necessary distinctions? Maybe... But I did not need
them when my world was built of the 127 bricks of ASCII.
Thomas said:with open(os.path.join(dirpath, name), 'r') as f:
TL;DR version: international character sets are a problem, and Unicode
is not the answer to that problem).
As long as I have used python (which I admit has only been 3 years)
Unicode has never appeared to be implemented correctly. Â I'm probably
repeating old arguments here, but whatever.
Unicode is a mess. Â When someone says ASCII, you know that they can only
mean characters 0-127. Â When someone says Unicode, do the mean real
Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8?
When using the 'u' datatype with the array module, the docs don't even
tell you if its 2 bytes wide or 4 bytes. Â Which is it? Â I'm sure that
all the of these can be figured out, but the problem is now I have to
ask every one of these questions whenever I want to use strings.
Secondly, Python doesn't do Unicode exception handling correctly. (but I
suspect that its a broader problem with languages) A good example of
this is with UTF-8 where there are invalid code points ( such as 0xC0,
0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
well as everyone else who wants to use strings for some reason).
When embedding Python in a long running application where user input is
received, it is very easy to make mistake which bring down the whole
program. Â If any user string isn't properly try/excepted, a user could
craft a malformed string which a UTF-8 decoder would choke on. Â Using
ASCII (or whatever 8 bit encoding) doesn't have these problems since all
codepoints are valid.
Another (this must have been a good laugh amongst the UniDevs) 'feature'
of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
Any string can masquerade as any other string by placing  few of these
in a string. Â Any word filters you might have are now defeated by some
cheesy Unicode nonsense character. Â Can you just just check for these
characters and strip them out? Â Yes. Â Should you have to? Â I would say no.
Does it get better? Â Of course! international character sets used for
domain name encoding use yet a different scheme (Punycode). Â Are the
following two domain names the same: tést.com , xn--tst-bma.com ?  Who
knows!
I suppose I can gloss over the pains of using Unicode in C with every
string needing to be an LPS since 0x00 is now a valid code point in
UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
strlen or concatenation operations.
Can it get even better? Â Yep. Â We also now need to have a Byte order
Mark (BOM) to determine the endianness of our characters. Â Are they
little endian or big endian? Â (or perhaps one of the two possible middle
endian encodings?) Â Who knows? Â String processing with unicode is
unpleasant to say the least. Â I suppose that's what we get when we
things are designed by committee.
But Hey! Â The great thing about standards is that there are so many to
choose from.
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.