String comparison problem

H

Henri

Hi,

How would one go about comparing 2 strings one of which may contain
special entities (eg "cassé" and "cassé")?
I tried to find a way to take the second string and do a replace
whenever such entities are encountered but this implies creating some
sort of lookup table containing not all but a good number of entity
codes. Unless I am mistaken, javascript does not any function to replace
an entity-infested string with a decoded string, pretty much like php's
html_entity_decode. Another way, probably better (but I don't know),
would be to encode the first string.

Any ideas?

Thanks
 
V

VK

How would one go about comparing 2 strings one of which may contain
special entities (eg "cassé" and "cassé")?

Unless there is some Google Groups server "optimization" here, I see
in the first case a word containing character e accent aigue and in
the second case a word containing numeric HTML entity "#233". In such
case these are two completely different issues here.
Javascript operates in Unicode, so it internally sees any string
literal as a Unicode sequence, no matter what the actual page encoding
is. If you need to sort and transform strings according to current
locale, use locale-specific string manipulation methods:
string1.localeCompare(string2)
and
toLocaleLowerCase()
toLocaleUpperCase()

In the second case (with HTML entity) it all depends from were are you
retrieving this string. If you are getting it from the content of a
loaded page, then by the time you are retrieving it the entities are
already parsed so for Javascript it is the same Unicode string as in
the first case, so you don't need to bother with extra transformation.
If it is a string literal "cassé" then obviously for Javascript
it is just a character sequence "c-a-s-s-&-#-2-3-3-;" and it has
nothing to do with "cassé". In this case either use RegExp to replace
entities by custom table; or insert the string into (hidden) HTML
element and read back the parsed value.
 
H

Henri

VK said:
Unless there is some Google Groups server "optimization" here, I see
in the first case a word containing character e accent aigue and in
the second case a word containing numeric HTML entity "#233". In such
case these are two completely different issues here.
Javascript operates in Unicode, so it internally sees any string
literal as a Unicode sequence, no matter what the actual page encoding
is. If you need to sort and transform strings according to current
locale, use locale-specific string manipulation methods:
string1.localeCompare(string2)
and
toLocaleLowerCase()
toLocaleUpperCase()

In the second case (with HTML entity) it all depends from were are you
retrieving this string. If you are getting it from the content of a
loaded page, then by the time you are retrieving it the entities are
already parsed so for Javascript it is the same Unicode string as in
the first case, so you don't need to bother with extra transformation.
If it is a string literal "cassé" then obviously for Javascript
it is just a character sequence "c-a-s-s-&-#-2-3-3-;" and it has
nothing to do with "cassé". In this case either use RegExp to replace
entities by custom table; or insert the string into (hidden) HTML
element and read back the parsed value.

That's the case and I've started experimenting with the replace
function. Calling, for instance, str.replace(/é/,"é") does produce
a "normalized" string. I have to generalize this in order to be able to
take into account most accented characters.
Thank you for your response.
 
H

Henri

VK said:
Unless there is some Google Groups server "optimization" here, I see
in the first case a word containing character e accent aigue and in
the second case a word containing numeric HTML entity "#233". In such
case these are two completely different issues here.
Javascript operates in Unicode, so it internally sees any string
literal as a Unicode sequence, no matter what the actual page encoding
is. If you need to sort and transform strings according to current
locale, use locale-specific string manipulation methods:
string1.localeCompare(string2)
and
toLocaleLowerCase()
toLocaleUpperCase()

In the second case (with HTML entity) it all depends from were are you
retrieving this string. If you are getting it from the content of a
loaded page, then by the time you are retrieving it the entities are
already parsed so for Javascript it is the same Unicode string as in
the first case, so you don't need to bother with extra transformation.
If it is a string literal "cassé" then obviously for Javascript
it is just a character sequence "c-a-s-s-&-#-2-3-3-;" and it has
nothing to do with "cassé". In this case either use RegExp to replace
entities by custom table; or insert the string into (hidden) HTML
element and read back the parsed value.

To replace an entity-encoded string by it's decoded equivalent:

String.prototype.normalize = function() {

return this.replace(/&#([0-9]{1,7});/,
function (str, p1, p2, offset, s) {
return String.fromCharCode(p1);
}
);

}

if s = "cassé" then using s.normalize() returns "cassé"

Henri
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,160
Messages
2,570,889
Members
47,422
Latest member
LatashiaZc

Latest Threads

Top