Fixing high end characters - esthetically

G

Guest

Is there an nicer, more esthetic, way to scan a string and replace all
characters with higher than 159 decimal value with &#---; entities as
required by html.

This seems so terribly clumsy (although quite functional).

function highremove(asd) {
work="" ;
for (j=0; j<asd.length; j++) {
temp = asd.charCodeAt(j) ;
work+= (temp > 159) ? "&#"+temp+";" : asd.charAt(j) ; }
asd=work ;
return asd ; }

TIA

K
 
S

Stevo

Is there an nicer, more esthetic, way to scan a string and replace all
characters with higher than 159 decimal value with &#---; entities as
required by html.

This seems so terribly clumsy (although quite functional).

More clumsy than you realize.
function highremove(asd) {
work="" ;

Well, that just overwrote any global variable called work if there is one.
for (j=0; j<asd.length; j++) {

and that overwrote any global variable called j if there is one.
temp = asd.charCodeAt(j) ;
work+= (temp > 159) ? "&#"+temp+";" : asd.charAt(j) ; }
asd=work ;
return asd ; }

That's awesome. asd=work; return asd; Whoever wrote that is a genius at
code efficiency. Before you consider using it, try this version:

function removeHigh(str)
{
var retstr="";
for(var c,i=0;i<str.length;i++)
{
c=str.charCodeAt(i);
retstr+= c>159 ? ("&#"+c+";"):str.charAt(i);
}
return retstr;
}
 
E

Eric Bednarz

Is there an nicer, more esthetic, way to scan a string and replace all
characters with higher than 159 decimal value with &#---; entities as
required by html.

1) ‘&#’ starts a ‘*character* reference’, not an ‘entity *reference*’.
2) HTML requires no such thing, since its document character set is the
Universal Character Set (ISO 10646), which is equivalent to
Unicode. The only problem you might face is encoding.
3) If you are paranoid about broken proxies or something like that you’d
have to do that server-side.
 
E

Evertjan.

Stevo wrote on 19 okt 2009 in comp.lang.javascript:
That's awesome. asd=work; return asd; Whoever wrote that is a genius at
code efficiency. Before you consider using it, try this version:

function removeHigh(str)
{
var retstr="";
for(var c,i=0;i<str.length;i++)
{
c=str.charCodeAt(i);
retstr+= c>159 ? ("&#"+c+";"):str.charAt(i);
}
return retstr;
}

For speed consider using an array:

<script type='text/javascript'>

function removeHigh(str) {
var arr = str.split('');
var len = arr.length;
var c;

for(var i = 0; i<len; i++) {
if ((c = arr.charCodeAt(0))>159)
arr = "&#"+c+";";
};
return arr.join('');
};

alert(removeHigh('aa€'));

</script>
 
W

wilq

Stevo wrote on 19 okt 2009 in comp.lang.javascript:
That's awesome. asd=work; return asd; Whoever wrote that is a genius at
code efficiency. Before you consider using it, try this version:
function removeHigh(str)
{
   var retstr="";
   for(var c,i=0;i<str.length;i++)
   {
     c=str.charCodeAt(i);
     retstr+= c>159 ? ("&#"+c+";"):str.charAt(i);
   }
   return retstr;
}

For speed consider using an array:

<script type='text/javascript'>

function removeHigh(str) {
  var arr = str.split('');
  var len = arr.length;
  var c;

  for(var i = 0; i<len; i++) {
    if ((c = arr.charCodeAt(0))>159)
      arr = "&#"+c+";";
    };
  return arr.join('');

};

alert(removeHigh('aa€'));

</script>


Depending on string length, first version of code might be faster than
second... I would rather advice to test both solutions for some kind
of representative collection of strings...
 
E

Evertjan.

Evertjan. wrote on 19 okt 2009 in comp.lang.javascript:
Stevo wrote on 19 okt 2009 in comp.lang.javascript:
That's awesome. asd=work; return asd; Whoever wrote that is a genius
at code efficiency. Before you consider using it, try this version:

function removeHigh(str)
{
var retstr="";
for(var c,i=0;i<str.length;i++)
{
c=str.charCodeAt(i);
retstr+= c>159 ? ("&#"+c+";"):str.charAt(i);
}
return retstr;
}

For speed consider using an array:

<script type='text/javascript'>

function removeHigh(str) {
var arr = str.split('');
var len = arr.length;
var c;

for(var i = 0; i<len; i++) {
if ((c = arr.charCodeAt(0))>159)
arr = "&#"+c+";";
};
return arr.join('');
};

alert(removeHigh('aa€'));

</script>


But why not use regex:

<script type='text/javascript'>

function removeHigh(str) {
return str.replace(/([^\u0000-\u0159])/g,
function(a){return "&#"+a.charCodeAt(0)+";"})
};

alert(removeHigh('aa€'));

</script>

[Yes, I know it is not a HTML requirement]
 
E

Evertjan.

wilq wrote on 20 okt 2009 in comp.lang.javascript:
Stevo wrote on 19 okt 2009 in comp.lang.javascript:
That's awesome. asd=work; return asd; Whoever wrote that is a genius at
code efficiency. Before you consider using it, try this version:
function removeHigh(str)
{
   var retstr="";
   for(var c,i=0;i<str.length;i++)
   {
     c=str.charCodeAt(i);
     retstr+= c>159 ? ("&#"+c+";"):str.charAt(i);
   }
   return retstr;
}

For speed consider using an array:

<script type='text/javascript'>

function removeHigh(str) {
  var arr = str.split('');
  var len = arr.length;
  var c;

  for(var i = 0; i<len; i++) {
    if ((c = arr.charCodeAt(0))>159)
      arr = "&#"+c+";";
    };
  return arr.join('');

};

alert(removeHigh('aa€'));

</script>


[please do not quote signatures on usenet]
Depending on string length, first version of code might be faster than
second... I would rather advice to test both solutions for some kind
of representative collection of strings...

I doubt that, especially if the number of high characters is low.

With a long string length the string concatenation will become terribly
slow!
 
D

Dr J R Stockton

In comp.lang.javascript message <o8epd5103tmic0ddaqlr7c9binqc017k9o@4ax.
com>, Mon, 19 Oct 2009 15:12:13, (e-mail address removed) posted:
Is there an nicer, more esthetic, way to scan a string and replace all
characters with higher than 159 decimal value with &#---; entities as
required by html.

This seems so terribly clumsy (although quite functional).

function highremove(asd) {
work="" ;
for (j=0; j<asd.length; j++) {
temp = asd.charCodeAt(j) ;
work+= (temp > 159) ? "&#"+temp+";" : asd.charAt(j) ; }
asd=work ;
return asd ; }

This might seem less clumsy :

function highremove(asd) {
return asd.replace(/[\u00a0-\uffff]/g,
function(a) { return "&#"+a.charCodeAt(0)+";"} ) }

And a simple test indicated, in FF3.0, that it is perhaps slightly
faster for a zero-length string, over 50 times faster for a string of 60
digits, 2.5 times faster for a string of 60 £ (GBP) characters, twice as
fast for 360 "£" or 60 "€" euros.

There seems to be a conflict between JavaScript and
Unicode : the former considers ffff to represent a
character while Unicode thinks that it is not a character.

It's a good idea to read the newsgroup c.l.j and its FAQ. See below.
 
L

Lasse Reichstein Nielsen

Evertjan. said:
With a long string length the string concatenation will become terribly
slow!

Most Javascript implementations have efficient string concatenation.
IE is, ofcourse, the exception, having quadratic behavior.
I don't recommend running this in IE:

var x = "x";
for (var i = 0; i <= 28; i++) {
x += x + "y" + x;
}
alert(x.length);

(Took a while before anything responded again :)

/L
 
D

Dr J R Stockton

Tue said:
Most Javascript implementations have efficient string concatenation.
IE is, ofcourse, the exception, having quadratic behavior.
I don't recommend running this in IE:

Some ways of getting a long string are testable at
<URL:http://www.merlyn.demon.co.uk/js-misc0.HTM#MLS>.

Untested; but it seems possible that some RegExp method might be fast,
since the work is done internally rather than on overt script.
 
E

Evertjan.

Dr J R Stockton wrote on 20 okt 2009 in comp.lang.javascript:
function highremove(asd) {
return asd.replace(/[\u00a0-\uffff]/g,
function(a) { return "&#"+a.charCodeAt(0)+";"} ) }

And a simple test indicated, in FF3.0, that it is perhaps slightly
faster for a zero-length string, over 50 times faster for a string of 60
digits, 2.5 times faster for a string of 60 ¶œ (GBP) characters, twice as
fast for 360 "¶œ" or 60 "ƒ'ª" euros.

That is why [^\u0000-\u009f] is more "logical" than [\u00a0-\uffff].

Would it be slower?
 
G

Guest

Thanks for your replies.

I found the regexp with a function for the string replacement, described by
Evertjan and Dr J R Stockton, most esthetic and instructional. I had no
idea that the matched character string was available as an argument for a
function to define the replacement string. It certainly is a slick way of
doing the job. A minor point is that the string to match is in hex rather
than decimal.

Evertjan's idea of splitting, editing and then joining with the intent of
speeding up the process strikes me as strange and avoid any improvement of
esthetics. Not only does the computer have to build the array but it needs
an index of some sort to keep track of things. Then the reassembly is the
same work as the original character by character method. Of course, much
depends of exactly how things are implemented in detail. In the distant
past, I knew of cases where several very tight loops were faster than one
longer loop because of the reduced memory accesses. I speak here with
little knowledge of actual details.

Stevo's comments are not useful at all and he misses the point worrying
about my variables name matching those of global variables. There was not
any need for me to discuss that here. He is correct that the extra copy is
redundant but does not consider it may be necessary in a longer context
which was not needed to have been shown here.

Bednarz corrects my imprecise use of terminology. Thank you. I was also
sloppy about saying HTML rather than XHTML for I have encountered some
requirement (perhaps incorrectly -- I need to check it out) that all high
end characters need to be character entities even when using utf-8. Also,
converting to &# form assures that the result is independent of my
computer's internal representation and will be correct in all computers.
There is also the situation, however, where one might not have full control
over the final page and need to deal with an explicitly more limited
character set. E.g., I might supply the variable content of page where
others proved the envelope with a limited character set.

Again thank you for taking the time and effort to respond.

K.
 
E

Evertjan.

K wrote on 20 okt 2009 in comp.lang.javascript:
Evertjan's idea of splitting, editing and then joining with the intent
of speeding up the process strikes me as strange and avoid any
improvement of esthetics.

In all but the most modern javascript engines,
repeated string concatenation is very slow,
as the string needs to be copied in full every time.

The array solution furthermore needs ony write the places where a >159
character is found, which in a usual textstring would be seldom.

The split and join are relatively very fast and need to be done only twice
together, while the concatenation is in the loop, also for normal
characters.

Esthetics? De gustibus non disputandum.
A minor point is that the string to match is in hex rather
than decimal.

True.

result = "&#" + (decNumb+65536).toString(16).substr(1) + ";";
described by Evertjan and Dr J R Stockton

Strange using John's and my name,
while we cannot describe your entry by even a nickname.
 
D

Dr J R Stockton

In comp.lang.javascript message <[email protected]>
Dr J R Stockton wrote on 20 okt 2009 in comp.lang.javascript:
function highremove(asd) {
return asd.replace(/[\u00a0-\uffff]/g,
function(a) { return "&#"+a.charCodeAt(0)+";"} ) }

And a simple test indicated, in FF3.0, that it is perhaps slightly
faster for a zero-length string, over 50 times faster for a string of 60
digits, 2.5 times faster for a string of 60 ¶œ (GBP) characters, twice as
fast for 360 "¶œ" or 60 "ƒ'ª" euros.

That is why [^\u0000-\u009f] is more "logical" than [\u00a0-\uffff].

Would it be slower?

Both describe the same set of JavaScript characters. If a literal
implementation of one would run faster than a literal implementation of
the other, then a respectable RegExp system should use only the better
way.

One test would be to compare for speed [a-bc-de-f ... y-z] with [a-z].
 
D

Dr J R Stockton

In comp.lang.javascript message <[email protected]>
The split and join are relatively very fast and need to be done only twice
together, while the concatenation is in the loop, also for normal
characters.

But split of an N-character string requires creation of N Objects, and
join will here result in their eventual disposal.

Let the timings be given.
Esthetics? De gustibus non disputandum.


True.

Is it? I used to write   for non-breaking space, and
A BC HTML source shows me A B?A0;C where the ? in News
represents a brick with four dots on it.
 
G

Guest

I've done some timing tests with the three methods with both IE7(7.0.6001)
and (FF3.5.3).

Preliminary results:
1: FF is far, far faster than IE in all cases. Ferrari vs. Model-T is
what comes to mind.

2: The RegExp method is the fastest but also most sensitive to the number
of conversions required. This probably do to the need to use a function to
get the matched character -- is there a way to avoid that?

3: IE is, as been previously mentioned, takes an excessive time to copy a
string character by character.

I hope to refine the results now that I see what is important and to try a
couple of other browsers.


A minor point is that the string to match is in hex rather than decimal.
Let me clarify this -- from testing I find that:.
* In the RegExp the \u1234 is taken as hex
* In strings (for HTML) the Ӓ is taken as decimal.
* Get charCodeAt() gives a decimal result.
I would expect that one can force a radix change.

Strange using John's and my name, while we cannot describe
your entry by even a nickname.
I use the signature or ‘reply to' get a name. You could replay to "K",
"Kral" or "Dr. Kral"

K.
 
E

Evertjan.

wrote on 22 okt 2009 in comp.lang.javascript:
2: The RegExp method is the fastest but also most sensitive to the
number of conversions required. This probably do to the need to use a
function to get the matched character -- is there a way to avoid that?

Why would you want to spow down whn there are no vonversions to be made?
 
E

Evertjan.

Dr J R Stockton wrote on 22 okt 2009 in comp.lang.javascript:
Is it? I used to write   for non-breaking space, and
A BC HTML source shows me A B?A0;C where the ? in News
represents a brick with four dots on it.

Ofcourse, my mistake.
 
E

Evertjan.

Evertjan. wrote on 22 okt 2009 in comp.lang.javascript:
wrote on 22 okt 2009 in comp.lang.javascript:


Why would you want to spow down whn there are no vonversions to be made?

Please, what did I write here? ;-)

"Why would you want to slow down when there are no conversions to be made?"

[sounds like a priest at work]
 
D

Dr J R Stockton

In comp.lang.javascript message <pebvd5l518063onuc9el5db827ic4geido@4ax.
com>, Wed, 21 Oct 2009 21:01:24, (e-mail address removed) posted:
I've done some timing tests with the three methods with both IE7(7.0.6001)
and (FF3.5.3).

Preliminary results:
1: FF is far, far faster than IE in all cases. Ferrari vs. Model-T is
what comes to mind.

Try Chrome. On the page I'm working on, it's so fast that by the time
the first stage is visibly finished, the other stages have been done
too.
2: The RegExp method is the fastest but also most sensitive to the number
of conversions required. This probably do to the need to use a function to
get the matched character -- is there a way to avoid that?

I don't expect that defining the function externally, outside the RegExp
part, will make it faster, or much slower. But it's probably worth
checking that,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,085
Messages
2,570,597
Members
47,218
Latest member
GracieDebo

Latest Threads

Top