String comparison problem

Henri · Jun 1, 2007

Hi,

How would one go about comparing 2 strings one of which may contain
special entities (eg "cassé" and "cassé")?
I tried to find a way to take the second string and do a replace
whenever such entities are encountered but this implies creating some
sort of lookup table containing not all but a good number of entity
codes. Unless I am mistaken, javascript does not any function to replace
an entity-infested string with a decoded string, pretty much like php's
html_entity_decode. Another way, probably better (but I don't know),
would be to encode the first string.

Any ideas?

Thanks

VK · Jun 2, 2007

How would one go about comparing 2 strings one of which may contain
special entities (eg "cassé" and "cassé")?

Unless there is some Google Groups server "optimization" here, I see
in the first case a word containing character e accent aigue and in
the second case a word containing numeric HTML entity "#233". In such
case these are two completely different issues here.
Javascript operates in Unicode, so it internally sees any string
literal as a Unicode sequence, no matter what the actual page encoding
is. If you need to sort and transform strings according to current
locale, use locale-specific string manipulation methods:
string1.localeCompare(string2)
and
toLocaleLowerCase()
toLocaleUpperCase()

In the second case (with HTML entity) it all depends from were are you
retrieving this string. If you are getting it from the content of a
loaded page, then by the time you are retrieving it the entities are
already parsed so for Javascript it is the same Unicode string as in
the first case, so you don't need to bother with extra transformation.
If it is a string literal "cassé" then obviously for Javascript
it is just a character sequence "c-a-s-s-&-#-2-3-3-;" and it has
nothing to do with "cassé". In this case either use RegExp to replace
entities by custom table; or insert the string into (hidden) HTML
element and read back the parsed value.

Henri · Jun 2, 2007

VK said:
Unless there is some Google Groups server "optimization" here, I see
in the first case a word containing character e accent aigue and in
the second case a word containing numeric HTML entity "#233". In such
case these are two completely different issues here.
Javascript operates in Unicode, so it internally sees any string
literal as a Unicode sequence, no matter what the actual page encoding
is. If you need to sort and transform strings according to current
locale, use locale-specific string manipulation methods:
string1.localeCompare(string2)
and
toLocaleLowerCase()
toLocaleUpperCase()

In the second case (with HTML entity) it all depends from were are you
retrieving this string. If you are getting it from the content of a
loaded page, then by the time you are retrieving it the entities are
already parsed so for Javascript it is the same Unicode string as in
the first case, so you don't need to bother with extra transformation.
If it is a string literal "cassé" then obviously for Javascript
it is just a character sequence "c-a-s-s-&-#-2-3-3-;" and it has
nothing to do with "cassé". In this case either use RegExp to replace
entities by custom table; or insert the string into (hidden) HTML
element and read back the parsed value.

That's the case and I've started experimenting with the replace
function. Calling, for instance, str.replace(/é/,"é") does produce
a "normalized" string. I have to generalize this in order to be able to
take into account most accented characters.
Thank you for your response.

Henri · Jun 2, 2007

VK said:
Unless there is some Google Groups server "optimization" here, I see
in the first case a word containing character e accent aigue and in
the second case a word containing numeric HTML entity "#233". In such
case these are two completely different issues here.
Javascript operates in Unicode, so it internally sees any string
literal as a Unicode sequence, no matter what the actual page encoding
is. If you need to sort and transform strings according to current
locale, use locale-specific string manipulation methods:
string1.localeCompare(string2)
and
toLocaleLowerCase()
toLocaleUpperCase()

In the second case (with HTML entity) it all depends from were are you
retrieving this string. If you are getting it from the content of a
loaded page, then by the time you are retrieving it the entities are
already parsed so for Javascript it is the same Unicode string as in
the first case, so you don't need to bother with extra transformation.
If it is a string literal "cassé" then obviously for Javascript
it is just a character sequence "c-a-s-s-&-#-2-3-3-;" and it has
nothing to do with "cassé". In this case either use RegExp to replace
entities by custom table; or insert the string into (hidden) HTML
element and read back the parsed value.

To replace an entity-encoded string by it's decoded equivalent:

String.prototype.normalize = function() {

return this.replace(/&#([0-9]{1,7});/,
function (str, p1, p2, offset, s) {
return String.fromCharCode(p1);
}
);

}

if s = "cassé" then using s.normalize() returns "cassé"

Henri

bit_vector comparison	7	Mar 20, 2007
string/list comparison	0	Jul 6, 2006
Tasks	1	Nov 29, 2022
a interesting Parallel Programing Problem: asciify-string	0	Mar 6, 2012
[perl-python] problem: reducing comparison	9	Feb 15, 2005
parsing large text file and pairwise comparison	7	Oct 24, 2006
Parsing String of Named Function & Converting To Source	5	Oct 18, 2011
matching string literals	4	Feb 1, 2011

String comparison problem

Henri

VK

Henri

Henri

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads