Portable 'lowercase' function for stl string?

A

Alf P. Steinbach

* Pete Becker:
* Squeamizh:

Look, it's simple: I said that case conversions under Unicode can be
rather slow compared to straight ASCII, and Alf challenged me to prove
that they're always slower. I declined to try to prove something that I
didn't say.

Heh, I'm still following this thread... ;-)

And you're misrepresenting the earlier exchange.

You wrote, originally,

"Unicode would be a poor choice if, for example, your characters are
encoded in ASCII and you care about speed".

And you explained this by

"because of the size of the character set and the resulting complexity
of the data representation for character attributes"

Perhaps that's not what you /meant/ to write, but that's what you wrote,
and that, including the explanation that followed in the same para, was
what I asked for an example of,

"could you give an example where case conversion of an arbitrary ASCII
text is necessarily faster than the same case conversion of the same
text in fixed a size per character Unicode representation (e.g. USC2
limited to BMP, or USC4)?"
(transposition typos not intentional and not corrected here).

I can think of a case where uppercasing or lowercasing Unicode will
likely be slower than ASCII for the same text, namely for a really large
text that must be in-memory, where one encounters more paging. But that
has nothing to do with the size of the character set, nor the resulting
complexity of the data representation for character attributes. In
other cases Unicode might generally be faster than ASCII.
 
P

P.J. Plauger

For those of us who are interested in learning more, I'll engage.
Since you've put yourself on the spot and attested to Pete's accuracy,
would you please briefly explain how Pete's original response is
correct?

Pete has himself clarified the misreading that launched the flames,
but to cut to the chase...

The C toupper and tolower date from a simpler time when you had
at most 256 characters, each with a one-to-one mapping between
upper and lower case. Unicode has (depending on how you count)
tens of thousands to millions of characters. Even if you ignore
the possibility of one-to-many conversions (which Unicode mostly
does) you either have to maintain *huge* lookup tables or compress
them and spend time searching. Thus, one way or the other, "simply"
going to Unicode when you don't have to assuredly costs you more
code space, more execution time, or both.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
P

Pete Becker

Alf said:
You wrote, originally,

"Unicode would be a poor choice if, for example, your characters are
encoded in ASCII and you care about speed".

And you explained this by

"because of the size of the character set and the resulting complexity
of the data representation for character attributes"

Perhaps that's not what you /meant/ to write, but that's what you wrote,

Yes, that is what I wrote, and it's what I /meant/ to write and it's
what I still mean.
and that, including the explanation that followed in the same para, was
what I asked for an example of,

"could you give an example where case conversion of an arbitrary ASCII
text is necessarily faster than the same case conversion of the same
text in fixed a size per character Unicode representation (e.g. USC2
limited to BMP, or USC4)?"
(transposition typos not intentional and not corrected here).

Sigh. I did not say that case conversion in ASCII is "necessarily"
faster. It's a better choice because it won't be slower and could be
faster, depending on whether the Unicode translation is special-cased
for ASCII. And, of course, it avoids the extra code and data that full
Unicode entails.
 
R

roberts.noah

Pete said:
Alf P. Steinbach wrote:

Sigh. I did not say that case conversion in ASCII is "necessarily"
faster. It's a better choice because it won't be slower and could be
faster, depending on whether the Unicode translation is special-cased
for ASCII. And, of course, it avoids the extra code and data that full
Unicode entails.

Besides, your statement didn't qualify "fixed size per character".
Alf, unfairly in my opinion, altered the course of the argument in
favor of such. Encodings such as UTF8 are quite commonly used...why
did Alf specify a certain subset of encodings? I wouldn't have walked
into that trap either.
 
A

Alf P. Steinbach

* (e-mail address removed):
* Pete Becker:

Besides, your statement didn't qualify "fixed size per character".
Alf, unfairly in my opinion, altered the course of the argument in
favor of such. Encodings such as UTF8 are quite commonly used...why
did Alf specify a certain subset of encodings? I wouldn't have walked
into that trap either.

If you want speed you have to use an encoding that supports that.

One could argue that Pete was talking about some ASCII encoding of
Unicode, that it was the encoding method, not the character set, that
would be slow, but then the statement ("Unicode would be a poor choice"
.... [because of these Unicode attributes]) would be self-contradictory.

Generally Unicode with a fixed size encoding is as fast as or near as
fast as you can get text operations. With ASCII you have to handle
individual bytes. The main question is then whether

char x = *p;
...
*p = x;

for some pointer p, is faster than, slower than, or the same as e.g.

int x = *q;
...
*q = x;

for processing each individual char, when *q is properly aligned.

On my PC they seem to be the same, because I get the same timing results
for ASCII and Unicode case conversion when I put in the assumption that
the text is in ASCII range. On some older RISC machines (perhaps some
new ones too?) byte access was reportedly slow compared to properly
aligned word access, for an individual item; the processor had to do
shifting and masking to do bytes. That must weighed up against an extra
comparision for the Unicode conversion when you don't have the
assumption of ASCII range, said extra comparision checking whether a
more general conversion must be invoked and executed when the char is
not an ASCII uppercase or lowercase or whatever the subset is that is to
be converted. Also, there is the question of whether that comparision
can simply disappear timing-wise in the parallelism in the processor.

I don't have a RISC machine at hand to check this out...

But depending on how those factors work out, if byte access is slow,
then Pete's statement above that "it won't be slower" is simply incorrect.
 
B

Ben Pope

Ivan said:
I don't think this is worth much of an argument. I assume that,
for non-ASCII, Pete was thinking of converting the case of the
many letters with diacritical marks, and those of non-latin
alphabets. This reasonably seems to require more work...

Somewhere is it was proposed that the same ASCII text was encoded within
the unicode.

I would assume that there is a *possibility* that there is an overhead
in space or time that comes with the ability to convert unicode, even if
all of the characters in the text are those that fit within the ASCII
subset. I also suspect that there is often a bias towards ASCII text
for operations on unicode and that any overhead for this case is
somewhere between negligible and non-existent.

It seems to me that Pete was arguing the possibility, and that Alf was
arguing the probability.

Ben Pope
 
J

JustBoo

Besides, your statement didn't qualify "fixed size per character".
Alf, unfairly in my opinion, altered the course of the argument in
favor of such.

Wow, what's that sucking sound... oh it's Noah's lips against
Pete's... well, you can figure out the rest. Apparently it's better to
be a sycophant than good in corporate America today. The "Alpha's"
love it.

politics, n: From the Latin 'poly', meaning many,
and 'tic', meaning little bloodsucking insects.
 
I

Ivan Vecerina

: Ivan Vecerina wrote:
: > I don't think this is worth much of an argument. I assume that,
: > for non-ASCII, Pete was thinking of converting the case of the
: > many letters with diacritical marks, and those of non-latin
: > alphabets. This reasonably seems to require more work...
....
: It seems to me that Pete was arguing the possibility, and that Alf was
: arguing the probability.


Yep. Nothing worth the stir IMO, although I do sympathize with Alf,
based on past experience ;)
 
R

roberts.noah

JustBoo said:
Wow, what's that sucking sound... oh it's Noah's lips against
Pete's... well, you can figure out the rest. Apparently it's better to
be a sycophant than good in corporate America today. The "Alpha's"
love it.

Have you EVER made a useful contribution to this group?
 
M

Michiel.Salters

Alf said:
* Pete Becker:
Alf said:
I think that [Unicode is slower-MS] is incorrect.

To convince me otherwise, could you give an example where case
conversion of an arbitrary ASCII text is necessarily faster than the
same case conversion of the same text in fixed a size per character
Unicode representation (e.g. USC2 limited to BMP, or USC4)?

Consider that ASCII is a subset of Unicode.
Case conversions in Unicode can't assume that the characters they're
dealing with are ASCII.

Well, that's not much of an example! ;-)

To quote yourself, again, "With ASCII, converting to lowercase is just a
test and an addition".

How would it be more if the same text is represented in UCS2 or UCS4?

Simple. Assume the following string L"i". That's an ASCII text encoded
as
UCS2 or UCS4 (for the purposes of the discussion). If it were just "i",
the
uppercase variant would be just "I". Not so in Unicode, where the
uppercase
would depend on the locale, and could be a dotted uppercase I (in
Turkish).

As Pete said: you can't assume that the characters you're dealing with
are
ASCII, /even if your input is ASCII/ !

HTH,
Michiel Salters
 
A

Alf P. Steinbach

* (e-mail address removed):
Alf said:
* Pete Becker:
Alf P. Steinbach wrote:
I think that [Unicode is slower-MS] is incorrect.

To convince me otherwise, could you give an example where case
conversion of an arbitrary ASCII text is necessarily faster than the
same case conversion of the same text in fixed a size per character
Unicode representation (e.g. USC2 limited to BMP, or USC4)?

Consider that ASCII is a subset of Unicode.

Case conversions in Unicode can't assume that the characters they're
dealing with are ASCII.
Well, that's not much of an example! ;-)

To quote yourself, again, "With ASCII, converting to lowercase is just a
test and an addition".

How would it be more if the same text is represented in UCS2 or UCS4?

Simple. Assume the following string L"i". That's an ASCII text encoded
asUCS2 or UCS4 (for the purposes of the discussion). If it were just "i",
the uppercase variant would be just "I". Not so in Unicode, where the
uppercase would depend on the locale, and could be a dotted uppercase I (in
Turkish).

Heh heh... The Turkish alphabet is broken beyond repair; there is no
general solution to the problem you have chosen as example. And
methinks you know that, and chose it for exactly that reason... :)

As Pete said: you can't assume that the characters you're dealing with
are ASCII, /even if your input is ASCII/ !

That's incorrect.

Cheers,

- Alf
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,201
Messages
2,571,048
Members
47,650
Latest member
IanTylor5

Latest Threads

Top