clarification on character handling

C

Chris Croughton

I think the only valid concern is that tolower(char_type) might
be invoked mistakenly, for some negative (char) value. This
won't happen for the basic character set,
Correct.

nor for the most
common codesets for *defined* character codes,

Incorrect. The most common character sets in western Europe are the
ISO-8559-x ones (IOS-8559-1 is commonly known as Latin-1; Microsoft's
Windows character sets for English-speaking versions are largely based
on that). They have the top bit of the char set.
but could happen on some platforms if random garbage values are passed
to tolower().

Or perfectly valid national characters, in many cases with a single
keystroke on a national keyboard.
In practice this could occur when the character
codes come from a hostile user, for example.

They don't have to be hostile -- nor non-English-speaking. Shift-3 on a
UK keyboard (we speak English in the UK, mostly) is the British pound
sign (looks like a stylised L with a line through it), and that is value
0xA3 (163 unsigned, -93 signed). It's very likely to be typed by a user
in a text field or document.
The most likely
actual risk is denial of service due to crashing the process
with an illegal memory reference.

With potential loss of data and revenue as high as you can imagine.
The "more secure library" TR under current development by WG14

Where can I find that? It's mentioned on the JTC1/SC22/WG14-C page[1]
as link "TR 24731: Programming language C - Specification for secure C
library functions", but going to that link[2] doesn't mention it (it
does mention and provide links to the other TRs in progress).

[1] http://www.open-std.org/jtc1/sc22/wg14/
[2] http://www.open-std.org/jtc1/sc22/wg14/www/projects#24731
is meant to provide a "drop-in" (easy automated editing) way to
catch such abuses in existing, not-so-carefully-constructed
applications. The alternative is to do a better job in the
original design and coding.

A better alternative would be to (a) make plain char unsigned (some very
few non-conforming programs might have problems) or (b) extend the range
of the ctype.h functions and macros to include the range CHAR_MIN to -1
(which would waste all of 128 bytes on some systems and otherwise hurt
no one).

One could, of course, use unsigned char explicitly for all arrays -- and
then lose all of the functions in string.h (or have to cast for every
use) because they rightly cause diagnostics if called with a pointer to
unsigned char. Or use type punning or multiple pointers to the same
object, both of which are unsafe. All to get round a design flaw which
'saves' all of 128 bytes typically.

Chris C
 
T

Tim Rentsch

aegis said:
7.4#1 states
The header <ctype.h> declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.

Why should something such as:
tolower(-10); invoke undefined behavior?

It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

Questions that ask "why" are often interesting questions.
Here there are several different answers, depending on what
kind of "why" is meant here.

First answer: to give freedom to implementations. Saying
that calling 'tolower' on arguments outside its range results
in undefined behavior gives implementations complete latitude
to do whatever they choose to in such situations. To say
this another way: to impose a specification that is minimal.

Second answer: because it's implementationally convenient.
Other people have commented on this aspect (with array
access, etc), so I don't think I need to say any more about
that.

Third answer: it's in keeping with "the spirit of C." Like
what the Rationale document says, C programmers expect things
to work when they do the right thing, but don't necessarily
expect any "safety net" when they do the wrong thing. The
definitions of tolower and the other <ctype.h> functions are
consistent with this philosophy.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,169
Messages
2,570,918
Members
47,458
Latest member
Chris#

Latest Threads

Top