trying to strip out non ascii.. or rather convert non ascii

B

bruce

hi..

getting some files via curl, and want to convert them from what i'm
guessing to be unicode.

I'd like to convert a string like this::
<div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcántar,
Iliana</a></div>

to::
<div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcantar,
Iliana</a></div>

where I convert the
" á " to " a"

which appears to be a shift of 128, but I'm not sure how to accomplish this...

I've tested using the different decode/encode functions using
utf-8/ascii with no luck.

I've reviewed stack overflow, as well as a few other sites, but
haven't hit the aha moment.

pointers/comments would be welcome.

thanks
 
S

Steven D'Aprano

hi..

getting some files via curl, and want to convert them from what i'm
guessing to be unicode.

I'd like to convert a string like this:: <div class="profName"><a
href="ShowRatings.jsp?tid=1312168">Alcántar, Iliana</a></div>

to::
<div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcantar,
Iliana</a></div>

where I convert the
" á " to " a"

Why on earth would you want to throw away perfectly good information?
It's 2013, not 1953, and if you're still unable to cope with languages
other than English, you need to learn new skills.

(Actually, not even English, since ASCII doesn't even support all the
characters used in American English, let alone British English. ASCII was
broken from the day it was invented.)

Start by getting some understanding:

http://www.joelonsoftware.com/articles/Unicode.html


Then read this post from just over a week ago:

https://mail.python.org/pipermail/python-list/2013-October/657827.html
 
D

Dennis Lee Bieber

Why on earth would you want to throw away perfectly good information?
It's 2013, not 1953, and if you're still unable to cope with languages
other than English, you need to learn new skills.

(Actually, not even English, since ASCII doesn't even support all the
characters used in American English, let alone British English. ASCII was
broken from the day it was invented.)

Compared to Baudot, both ASCII and EBCDIC were probably considered
wondrous.
 
R

Roy Smith

Dennis Lee Bieber said:
Compared to Baudot, both ASCII and EBCDIC were probably considered
wondrous.

Wonderous, indeed. Why would anybody ever need more than one case of
the alphabet? It's almost as absurd as somebody wanting to put funny
little marks on top of their vowels.
 
T

Tim Chase

Why on earth would you want to throw away perfectly good
information?

The main reason I've needed to do it in the past is for normalization
of search queries. When a user wants to find something containing
"pingüino", I want to have those results come back even if they type
"pinguino" in the search box.

For the same reason searches are often normalized to ignore case.
The difference between "Polish" and "polish" is visually just
capitalization, but most folks don't think twice about

if term.upper() in datum.upper():
it_matches()

I'd be just as happy if Python provided a "sloppy string compare"
that ignored case, diacritical marks, and the like.

unicode_haystack1 = u"pingüino"
unicode_haystack2 = u"¡Miré un pingüino!"
needle = u"pinguino"
if unicode_haystack1.sloppy_equals(needle):
it_matches()
if unicode_haystack2.sloppy_contains(needle):
it_contains()

As a matter of fact, I'd even be happier if Python did the heavy
lifting, since I wouldn't have to think about whether I want my code
to force upper-vs-lower for the comparison. :)

-tkc
 
R

Roy Smith

Tim Chase said:
I'd be just as happy if Python provided a "sloppy string compare"
that ignored case, diacritical marks, and the like.

The problem with putting fuzzy matching in the core language is that
there is no general agreement on how it's supposed to work.

There are, however, third-party libraries which do fuzzy matching. One
popular one is jellyfish (https://pypi.python.org/pypi/jellyfish/0.1.2).
Don't expect you can just download and use it right out of the box,
however. You'll need to do a little thinking about which of the several
algorithms it includes makes sense for your application.

So, for example, you probably expect U+004 (Latin Capital letter N) to
match U+006 (Latin Small Letter N). But, what about these (all cribbed
from Wikipedia):

U+00D1 Ñ Ñ &Ntilde; Latin Capital letter N with tilde
U+00F1 ñ ñ &ntilde; Latin Small Letter N with tilde
U+0143 C Ń Latin Capital Letter N with acute
U+0144 D ń Latin Small Letter N with acute
U+0145 E Ņ Latin Capital Letter N with cedilla
U+0146 F ņ Latin Small Letter N with cedilla
U+0147 G Ň Latin Capital Letter N with caron
U+0148 H ň Latin Small Letter N with caron
U+0149 I ʼn Latin Small Letter N preceded by apostrophe[1]
U+014A J Ŋ Latin Capital Letter Eng
U+014B K ŋ Latin Small Letter Eng
U+019D #413; Latin Capital Letter N with left hook
U+019E #414; Latin Small Letter N with long right leg
U+01CA #458; Latin Capital Letter NJ
U+01CB #459; Latin Capital Letter N with Small Letter J
U+01CC #460; Latin Small Letter NJ
U+0235 #565; Latin Small Letter N with curl

I can't even begin to guess if they should match for your application.
 
S

Steven D'Aprano

Wonderous, indeed. Why would anybody ever need more than one case of
the alphabet? It's almost as absurd as somebody wanting to put funny
little marks on top of their vowels.

Vwls? Wh wst tm wrtng dwn th vwls?
 
T

Tim Chase

The problem with putting fuzzy matching in the core language is
that there is no general agreement on how it's supposed to work.

There are, however, third-party libraries which do fuzzy matching.
One popular one is jellyfish
(https://pypi.python.org/pypi/jellyfish/0.1.2).

Bookmarking and archiving your email for future reference.
Don't expect you can just download and use it right out of the box,
however. You'll need to do a little thinking about which of the
several algorithms it includes makes sense for your application.

I'd be content with a baseline that denormalizes and then strips out
combining diacritical marks, something akin to MRAB's

from unicodedata import normalize
"".join(c for c in normalize("NFKD", s) if ord(c) < 0x80)

and tweaking it if that was insufficient.

Thanks for the link to Jellyfish.

-tkc
 
N

Nobody

I'd be just as happy if Python provided a "sloppy string compare"
that ignored case, diacritical marks, and the like.

Simply ignoring diactrics won't get you very far.

Most languages which use diactrics have standard conversions, e.g.
ö -> oe, which are likely to be used by anyone familiar with the
language e.g. when using software (or a keyboard) which can't handle
diactrics.

OTOH, others (particularly native English speakers) may simply discard the
diactric. So to be of much use, a fuzzy match needs to handle either
possibility.
 
W

wxjmfauth

Le dimanche 27 octobre 2013 04:21:46 UTC+1, Nobody a écrit :
Simply ignoring diactrics won't get you very far.

Right. As an example, these four French words :
cote, côte, coté, côté .
Most languages which use diactrics have standard conversions, e.g.

ö -> oe, which are likely to be used by anyone familiar with the

language e.g. when using software (or a keyboard) which can't handle

diactrics.

I'm quite confortable with Unicode, esp. with the
Latin blocks.
Except this German case (I remember very old typewriters),
what are the other languages presenting this kind of
allowed feature ?

Just as a reminder. They are 1272 characters considered
as Latin characters (how to count them it not a simple
task), and if my knowledge is correct, they are covering
and/or are here to cover the 17 languages, to be exact,
the 17 European languages based on a Latin alphabet which
can not be covered with iso-8859-1.

And of course, logically, they are very, very badly handled
with the Flexible String Representation.

jmf
 
M

Mark Lawrence

Just as a reminder. They are 1272 characters considered
as Latin characters (how to count them it not a simple
task), and if my knowledge is correct, they are covering
and/or are here to cover the 17 languages, to be exact,
the 17 European languages based on a Latin alphabet which
can not be covered with iso-8859-1.

And of course, logically, they are very, very badly handled
with the Flexible String Representation.

jmf

Please provide us with evidence to back up your statement.
 
T

Tim Chase

Right. As an example, these four French words :
cote, côte, coté, côté .

Distinct words with distinct meanings, sure.

But when a naïve (naive? ☺) person or one without the easy ability
to enter characters with diacritics searches for "cote", I want to
return possible matches containing any of your 4 examples. It's
slightly fuzzier if they search for "coté", in which case they may
mean "coté" or they might mean be unable to figure out how to
add a hat and want to type "côté". Though I'd rather get more
results, even if it has some that only match fuzzily.

Circumflexually-circumspectly-yers,

-tkc
 
S

Steven D'Aprano

And of course, logically, they are very, very badly handled with the
Flexible String Representation.

I'm reminded of Cato the Elder, the Roman senator who would end every
speech, no matter the topic, with "Ceterum censeo Carthaginem esse
delendam" ("Furthermore, I consider that Carthage must be destroyed").

But at least he had the good grace to present that as an opinion, instead
of repeating a falsehood as if it were a fact.
 
S

Steven D'Aprano

Distinct words with distinct meanings, sure.

But when a naïve (naive? ☺) person or one without the easy ability to
enter characters with diacritics searches for "cote", I want to return
possible matches containing any of your 4 examples. It's slightly
fuzzier if they search for "coté", in which case they may mean "coté" or
they might mean be unable to figure out how to add a hat and want to
type "côté". Though I'd rather get more results, even if it has some
that only match fuzzily.

The right solution to that is to treat it no differently from other fuzzy
searches. A good search engine should be tolerant of spelling errors and
alternative spellings for any letter, not just those with diacritics.
Ideally, a good search engine would successfully match all three of
"naïve", "naive" and "niave", and it shouldn't rely on special handling
of diacritics.
 
W

wxjmfauth

Le mardi 29 octobre 2013 06:22:27 UTC+1, Steven D'Aprano a écrit :
I'm reminded of Cato the Elder, the Roman senator who would end every

speech, no matter the topic, with "Ceterum censeo Carthaginem esse

delendam" ("Furthermore, I consider that Carthage must be destroyed").



But at least he had the good grace to present that as an opinion, instead

of repeating a falsehood as if it were a fact.
0.26411553466961735

If you are understanding the coding of characters, Unicode
and what this FSR does, it is a child play to produce gazillion
of examples like this.

(Notice the usage of a Dutch character instead of a boring €).

jmf
 
T

Tim Chase

0.26411553466961735

That reads to me as "If things were purely UCS4 internally, Python
would normally take 0.264... seconds to execute this test, but core
devs managed to optimize a particular (lower 127 ASCII characters
only) case so that it runs in less than half the time."

Is this not what you intended to demonstrate? 'cuz that sounds
like a pretty awesome optimization to me.

-tkc
 
W

wxjmfauth

Le mardi 29 octobre 2013 16:52:49 UTC+1, Tim Chase a écrit :
That reads to me as "If things were purely UCS4 internally, Python

would normally take 0.264... seconds to execute this test, but core

devs managed to optimize a particular (lower 127 ASCII characters

only) case so that it runs in less than half the time."



Is this not what you intended to demonstrate? 'cuz that sounds

like a pretty awesome optimization to me.



-tkc

--------

That's very naive. In fact, what happens is just the opposite.
The "best case" with the FSR is worst than the "worst case"
without the FSR.

And this is just without counting the effect that this poor
Python is spending its time in switching from one internal
representation to one another, without forgetting the fact
that this has to be tested every time.
The more unicode manipulations one applies, the more time
it demands.

Two tasks, that come in my mind: re and normalization.
It's very interesting to observe what happens when one
normalizes latin text and polytonic Greek text, both with
plenty of diactrics.

----

Something different, based on my previous example.

What a European user is supposed to think, when she/he
sees, she/he can be "penalized" by such an amount,
simply by using non ascii characters for a product
which is supposed to be "unicode compliant" ?

jmf
 
M

Mark Lawrence

Le mardi 29 octobre 2013 16:52:49 UTC+1, Tim Chase a écrit :

--------

That's very naive. In fact, what happens is just the opposite.
The "best case" with the FSR is worst than the "worst case"
without the FSR.

And this is just without counting the effect that this poor
Python is spending its time in switching from one internal
representation to one another, without forgetting the fact
that this has to be tested every time.
The more unicode manipulations one applies, the more time
it demands.

Two tasks, that come in my mind: re and normalization.
It's very interesting to observe what happens when one
normalizes latin text and polytonic Greek text, both with
plenty of diactrics.

----

Something different, based on my previous example.

What a European user is supposed to think, when she/he
sees, she/he can be "penalized" by such an amount,
simply by using non ascii characters for a product
which is supposed to be "unicode compliant" ?

jmf

Please provide hard evidence to support your claims or stop posting this
ridiculous nonsense. Give us real world problems that can be reported
on the bug tracker, investigated and resolved.
 
P

Piet van Oostrum

Mark Lawrence said:
Please provide hard evidence to support your claims or stop posting this
ridiculous nonsense. Give us real world problems that can be reported
on the bug tracker, investigated and resolved.

I think it is much better just to ignore this nonsense instead of asking for evidence you know you will never get.
 
C

Chris Angelico

You've stated above that logically unicode is badly handled by the fsr. You
then provide a trivial timing example. WTF???

His idea of bad handling is "oh how terrible, ASCII and BMP have
optimizations". He hates the idea that it could be better in some
areas instead of even timings all along. But the FSR actually has some
distinct benefits even in the areas he's citing - watch this:
0.3582399439035271

The first two examples are his examples done on my computer, so you
can see how all four figures compare. Note how testing for the
presence of a non-Latin1 character in an 8-bit string is very fast.
Same goes for testing for non-BMP character in a 16-bit string. The
difference gets even larger if the string is longer:
2.8308718007456264

Wow! The FSR speeds up searches immensely! It's obviously the best
thing since sliced bread!

ChrisA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,965
Messages
2,570,148
Members
46,710
Latest member
FredricRen

Latest Threads

Top