can a character be negative?

R

Rahul

Hi folks,

I have been wondering that can a char can be processed as a negative
character means if I do something like..

char c= '-2';
printf("%c",c);

Output should be -2 instead of 2
If the answer is no then what is the existence of ...

signed char ranging from -128 to 127.
What is that part meant for yeas I am talking about -128 to 0,because
no way we are doing to get a char representation as a positive value.
Why do we say or make most of our libraries function char values as
unsigned char if by default all character are unsigned!!
I hope I am able to make the situation clear ..

Please guide me.

Cheers!!
 
H

Hallvard B Furuseth

Rahul said:
I have been wondering that can a char can be processed as a negative
character means if I do something like..

char c= '-2';

That can be negative, though what you mean is c = -2; without quotes.
printf("%c",c);

That prints c as a character, not as a character code.

Anyway, negative char is quite normal outside 7-bit ASCII land.

#include <stdio.h>
int main() { printf("%d\n", 'å'); return 0; }

prints -27 on the host where I'm writing this.

That's why e.g. <ctype.h> says its functions take values in the range of
unsigned char, or EOF. Generally when you want the character code of a
char c, you should use (unsigned char) c.
 
R

Rahul

That can be negative, though what you mean is c = -2; without quotes.


That prints c as a character, not as a character code.

Anyway, negative char is quite normal outside 7-bit ASCII land.

  #include <stdio.h>
  int main() { printf("%d\n", 'å'); return 0; }

prints -27 on the host where I'm writing this.

That's why e.g. <ctype.h> says its functions take values in the range of
unsigned char, or EOF.  Generally when you want the character code of a
char c, you should use (unsigned char) c.
ok !! is it the only reason? How can we map this issue ie making a
char sign or unsigned to portability?
K&R2 page no 42 .bottom few line...

"For portability specify signed or unsigned if non-character data is
to be stored in char variables"

Cheers!!
 
K

Keith Thompson

Rahul said:
ok !! is it the only reason? How can we map this issue ie making a
char sign or unsigned to portability?
K&R2 page no 42 .bottom few line...

"For portability specify signed or unsigned if non-character data is
to be stored in char variables"

I really don't know what you're asking.

If you want to store character values, use char. If you want very
small signed numbers, use signed char. If you want very small
unsigned numbers or raw bytes, use unsigned char.

If that doesn't answer your question, you'll have to ask more clearly.
 
R

Rahul

I really don't know what you're asking.

If you want to store character values, use char.  If you want very
small signed numbers, use signed char.  If you want very small
unsigned numbers or raw bytes, use unsigned char.

If that doesn't answer your question, you'll have to ask more clearly.

--
Keith Thompson (The_Other_Keith) (e-mail address removed)  <http://www.ghoti.net/~kst>
Nokia
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

My question is simple..

1.Is there any point of defining a char variable as unsigned when I am
dealing with pure ASCII characters only ?
2.and in what scenarios should I define a character as unsigned?
2.Is this issue has anything to do with portability,means defining
something as sign character will fail on that machine which doesn.t
support signed byte... is that the reason we keep our char definitions
as unsigned (portability reason)...
like if we do...

on x86(which supports sign byte)
char c=-2; in memory 0xfe
unsigned char c=-2 in memory 0xfe ...but any expression which
reference this will evaluate it as 254

on machine which doesn't support signed byte

char c=-2; in memory 0x02 no issue of making it two's
compliment...so store the numeric value ignoring sign
unsigned char c=-2 in memory 0xfe it will be rounded up to
positive value ie 254

I dont know any machine other than x86,this is my assumption
only ...but K&R2 says that if char c=-2; (which is 0xfe looks like
negative) "arbitrary bit patterns stored in character variables may
appear to be negative on some machines ,yet positive on others.For
portability,specify signed or unsigned if non-character data is to be
stored in char variables"

If someone would be able to explain at architecture level,I would
really appreciate that.
Hope this time I am able to articulate what I am confused about.

Cheers!!
 
J

James Kuyper

Rahul said:
My question is simple..

1.Is there any point of defining a char variable as unsigned when I am
dealing with pure ASCII characters only ?

No. As Keith said above â€If you want to store character values, use char.â€
2.and in what scenarios should I define a character as unsigned?

You shouldn't. If you know that it is a character, use char. Only use
unsigned char if you're storing numbers rather that characters.
2.Is this issue has anything to do with portability,means defining

Yes. the plain 'char' type is signed on some implementations, and an
unsigned type on others. You need to keep that possibility in mind when
you write your code. You should convert to 'unsigned char' before
passing a char value to one of the <ctype.h> macros; otherwise it's not
very difficult to avoid problems.

It's very easy to distinguish the two cases: if char is signed, than
CHAR_MIN will be negative, otherwise it will be 0.
char c=-2; in memory 0xfe

As Keith said above, if you're storing a number, you shouldn't use char.
Use signed char or unsigned char. Obviously, you need signed char if you
want to store a value of -2.
 
K

Keith Thompson

James Kuyper said:
Rahul wrote: [...]
2.and in what scenarios should I define a character as unsigned?

You shouldn't. If you know that it is a character, use char. Only use
unsigned char if you're storing numbers rather that characters.

Well, mostly. The is*() and to*() functions in <ctype.h> expect
arguments representable as an unsigned char (or the value EOF).
 
B

bartc

Rahul said:
Hi folks,

I have been wondering that can a char can be processed as a negative
character means if I do something like..

char c= '-2';
printf("%c",c);

Output should be -2 instead of 2
If the answer is no then what is the existence of ...

I doubt whether the output is going to be -2, it's going to be a single
character at most, and probably something weird (a small square block on my
machine, corresponding to code 254 which has the same bit pattern as -2).

'char' is a misnomer for this type, which is really just a very short
integer (typically 8 bits) that can be signed or unsigned.

For storing actual character data, there are apparently machines where some
character sets use negative codes, but I don't know of any. I only know
ASCII which uses +0 to +127, and various supersets which still have positive
values.
 
B

BGB / cr88192

Rahul said:
Hi folks,

I have been wondering that can a char can be processed as a negative
character means if I do something like..

char c= '-2';
printf("%c",c);

Output should be -2 instead of 2
If the answer is no then what is the existence of ...

signed char ranging from -128 to 127.
What is that part meant for yeas I am talking about -128 to 0,because
no way we are doing to get a char representation as a positive value.
Why do we say or make most of our libraries function char values as
unsigned char if by default all character are unsigned!!
I hope I am able to make the situation clear ..

simple answer:
char is normally signed (granted, not all C compilers agree to this, as a
few older/oddball compilers have made it default to unsigned).


so 'char'=='character' is a misnomer (historical accident?...) since for
most practical uses, ASCII and UTF-8 chars are better treated as unsigned
(we just use 'char' as a matter of tradition, and cast to unsigned char
wherever it matters), and for most other uses (where we want a signed byte),
thinking of 'char' as 'character' is misleading (note that there are many
cases where a signed 8-bit value actually makes some sense).


many other (newer) languages reinterpret things, typically assigning 'char'
to a larger size (most often 16, or sometimes 32 bits) and adding
byte/sbyte/ubyte/... for the 8-bit types (there is some inconsistency as to
whether 'byte' is signed or unsigned for a given language, so it depends
some on the particular language designer).

in my own uses, I typically use typedef to define 'byte' as 'unsigned char'
and 'sbyte' as 'signed char'. I also use 'u8' and 's8' sometimes.


or such...
 
K

Keith Thompson

BGB / cr88192 said:
simple answer:
char is normally signed (granted, not all C compilers agree to this, as a
few older/oddball compilers have made it default to unsigned).
[...]

Most of the compilers I've used have char signed, but I've used
several where it's unsigned (several Cray systems, SGI Irix, IBM AIX).
And char is almost certainly signed on any EBCDIC-based system.

It's safest not to think of either signed or unsigned as "normal".
Use plain char only if you don't *care* whether it's signed or
unsigned.
 
L

Lew Pitcher

BGB / cr88192 said:
simple answer:
char is normally signed (granted, not all C compilers agree to this, as a
few older/oddball compilers have made it default to unsigned).
[...]

Most of the compilers I've used have char signed, but I've used
several where it's unsigned (several Cray systems, SGI Irix, IBM AIX)..
And char is almost certainly signed on any EBCDIC-based system.

I'm afraid not. Although I now lack an IBM mainframe to check on, I believe
that char is (by necessity) unsigned on EBCDIC systems.

You see, the characters '0' through '9' are represented by the octets 0xF0
thrugh 0xF9. Given that CHAR_BIT == 8 on EBCDIC systems, and the C standard
(1990, although it should be the same in all versions) states (in section
5.2.1.3) that
"Both the basic source and basic execution character sets shall have the
following members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ˆ _ { | } ˜
the space character, and control characters representing horizontal tab,
vertical tab, and form feed. The representation of each member of the
source and execution basic character sets shall ï¬t in a byte."
then the octets 0xF0 through 0xF9 are considered to be part of the "basic
source" and/or "basic execution" charactersets.

Knowing this, we then consider the effect of section 6.2.5.3, in that
"An object declared as type char is large enough to store any member of
the basic execution character set. If a member of the basic execution
character set is stored in a char object, its value is guaranteed to be
positive."

So, 0xF0 through 0xF9 are guaranteed to be positive.

Since these systems use twos-complement math, and (for octets) values over
0x7f are considered negative in that math, the octets 0xF0 through 0xF9
would be considered to be negative values if char were signed.

It's safest not to think of either signed or unsigned as "normal".
Use plain char only if you don't *care* whether it's signed or
unsigned.

Agreed.

--
Lew Pitcher

Master Codewright & JOAT-in-training | Registered Linux User #112576
http://pitcher.digitalfreehold.ca/ | GPG public key available by request
---------- Slackware - Because I know what I'm doing. ------
 
K

Keith Thompson

Lew Pitcher said:
On September 29, 2009 20:17, in comp.lang.c, Keith Thompson ([email protected])
wrote: [...]
Most of the compilers I've used have char signed, but I've used
several where it's unsigned (several Cray systems, SGI Irix, IBM AIX).
And char is almost certainly signed on any EBCDIC-based system.

I'm afraid not. Although I now lack an IBM mainframe to check on, I believe
that char is (by necessity) unsigned on EBCDIC systems.
[explanation snipped]

Whoops! Yes, I knew that; I thought "unsigned" and typed "signed".
 
P

Phil Carmody

Keith Thompson said:
James Kuyper said:
Rahul wrote: [...]
2.and in what scenarios should I define a character as unsigned?

You shouldn't. If you know that it is a character, use char. Only use
unsigned char if you're storing numbers rather that characters.

Well, mostly. The is*() and to*() functions in <ctype.h> expect
arguments representable as an unsigned char (or the value EOF).

Wishy-washy-signedness of chars is full of traps - your list isn't exaustive.

Phil
 
N

Nick Keighley

simple answer:
char is normally signed (granted, not all C compilers agree to this, as a
few older/oddball compilers have made it default to unsigned).

who says it's normally signed? I've seen compilers that made it
optional.
Since it hardly ever matters I don't understand why you care.

so 'char'=='character' is a misnomer (historical accident?...)

its an historical fact. It's hardly a misnomer, an accident or even an
error.

char is a C type for holding characters. I agree it might have been
a good idea to have a byte type as well.
since for
most practical uses, ASCII and UTF-8 chars are better treated as unsigned

why should ASCII be unsigned? ASCII fits in 7 bits. Even extended
ASCIIs
still manage fine as signed values.

(we just use 'char' as a matter of tradition, and cast to unsigned char
wherever it matters),

it never matters with character data. I use unsigned char when I'm
manipulating external representations (bytes or octets)
and for most other uses (where we want a signed byte),

that is, hardly ever. I'm tempted to say "never" as I don't think
I've ever needed tiny little integers. But I can imagine uses
for TLIs.
thinking of 'char' as 'character' is misleading

I disagree
(note that there are many
cases where a signed 8-bit value actually makes some sense).

such as?
many other (newer) languages reinterpret things,

but that doesn't matter

in my own uses, I typically use typedef to define 'byte' as 'unsigned char'

I commonly do this
and 'sbyte' as 'signed char'.

I never do this
I also use 'u8' and 's8' sometimes.

I dislike these. Seems to be letting the metal show through
 
B

bartc

Nick said:
who says it's normally signed? I've seen compilers that made it
optional.
Since it hardly ever matters I don't understand why you care.

why should ASCII be unsigned? ASCII fits in 7 bits. Even extended
ASCIIs
still manage fine as signed values.

It would be perverse. Anyone who had to make the decision, wouldn't
deliberately choose signed format for character data. It's just asking for
trouble. (Try creating a histogram of the 256 character codes used in some
text, you will need the character code to index into an array. It's a lot
easier with 0 to 255 rather than -128 to 127)

There should have been a char type (unsigned, but that doesn't even need
mentioning), and separate signed/unsigned ultra-short integers, ie.
byte-sized. (All easily added added with typedefs, but in practice, no-one
bothers.)
it never matters with character data. I use unsigned char when I'm
manipulating external representations (bytes or octets)


that is, hardly ever. I'm tempted to say "never" as I don't think
I've ever needed tiny little integers. But I can imagine uses
for TLIs.

There's a few, such as ensuring your data just fitting into the memory of
your computer, rather than needing double the memory.

I haven't done any research but I'm guessing that a big chunk of the 'int'
variables in my code do only contain values representable in one byte. The
waste can be usually be ignored, on PCs, but in arrays/arrays of structs and
so on, it can be significant.
 
E

Eric Sosman

bartc said:
It would be perverse. Anyone who had to make the decision, wouldn't
deliberately choose signed format for character data. It's just asking
for trouble. (Try creating a histogram of the 256 character codes used
in some text, you will need the character code to index into an array.
It's a lot easier with 0 to 255 rather than -128 to 127)

If omitting `-CHAR_MIN' from an array index makes things "a
lot easier" rather than just "a trifle easier," you must use a
different difficulty scale than I'm accustomed to.
There should have been a char type (unsigned, but that doesn't even need
mentioning), and separate signed/unsigned ultra-short integers, ie.
byte-sized. (All easily added added with typedefs, but in practice,
no-one bothers.)

There was a time when I shared this opinion, but I think that
if DMR had specified unsignedness for `char' in original C, the
language would not have become popular. Machines that do most
operations in CPU registers need to fetch that `char' from memory,
and the receiving register is usually wider than the `char' is.
So what happens to the register's extra bits when the `char' is
loaded? I've seen three styles: The extra bits are zeroed, or
are copied from the `char's high-order bit, or are unchanged.
(All these are real behaviors of real machines, by the way: I'm
not talking about the DeathStation product line.)

Had DMR insisted on unsigned `char', machines of the first
type would have been happy but those of the second type would
have incurred the penalty of a full-word AND (or equivalent)
after every `char' fetch. In the days of limited memory and
slow cycles this would have put C at a disadvantage on those
machines, a disadvantage that might well have been crippling.
Remember, too, that the compiler had to run in limited memory
and with slow cycles, and would have been hard-pressed to figure
out when the AND might be avoidable. Instead of being the cradle
of C, the PDP-11 might have been its grave. By leaving the
signedness of `char' unspecified, DMR allowed the PDP-11 and
similar machines to "do what comes naturally" and use code that
was efficient for the architecture.

Machines of the third type -- well, there's a limit to how
far you can allow the language to bend. A language that said
"The value of a `char' variable is unspecified and may change
unpredictably without anything being stored to it, but at least
the low-order bits will remain intact" would not gather much of
a following ... Fortunately, on the only machine of this type
that I've personally used, the operation of zeroing a register
is cheap and can be inserted just before each `char' load without
a huge time or space penalty.
 
B

bartc

Eric Sosman said:
If omitting `-CHAR_MIN' from an array index makes things "a
lot easier" rather than just "a trifle easier," you must use a
different difficulty scale than I'm accustomed to.

Avoiding having to mess about with offsets (and doing a double-take on
whether it's +CHAR_MIN or -CHAR_MIN, knowing the latter is negative), and
just not having to keep possible negativeness of your char values always in
mind, makes it a little more than a trifle easier.
There was a time when I shared this opinion, but I think that
if DMR had specified unsignedness for `char' in original C, the
language would not have become popular. Machines that do most
operations in CPU registers need to fetch that `char' from memory,
and the receiving register is usually wider than the `char' is.
So what happens to the register's extra bits when the `char' is
loaded? I've seen three styles: The extra bits are zeroed, or
are copied from the `char's high-order bit, or are unchanged.
(All these are real behaviors of real machines, by the way: I'm
not talking about the DeathStation product line.)

You're obviously familiar with a lot more machines types than I am.

I've only ever programmed (to machine level), PDP10, Z80, 6800, 8051(?), and
x86 series architectures. Most of these have registers that are the same
width as a character.

I'm not so familiar with PDP11, but I think byte values used the lower half
of each register, with no auto-extend from 8 to 16 bits, and anyway can work
with that lower half independently, effectively giving it byte-wide
registers.

So having a separate, permanently unsigned char type I don't think would
have been an issue, *unless* the C language insists on char expressions
being evaluated as ints.

This would require unnecessary widening, and the default signedness of chars
might well depend on whether sign- or zero-extend was fastest. In that case,
*that* becomes the issue.
of C, the PDP-11 might have been its grave. By leaving the
signedness of `char' unspecified, DMR allowed the PDP-11 and
similar machines to "do what comes naturally" and use code that
was efficient for the architecture.
a following ... Fortunately, on the only machine of this type
that I've personally used, the operation of zeroing a register
is cheap and can be inserted just before each `char' load without
a huge time or space penalty.

OK, so which category does PDP11 come into? And what operation allows it to
load a char value into a register that will also sign-extend or clear the
top half?
 
B

bartc

Eric Sosman said:
bartc said:
[...]
I'm not so familiar with PDP11, but I think byte values used the lower
half of each register, with no auto-extend from 8 to 16 bits, and anyway
can work with that lower half independently, effectively giving it
byte-wide registers.

Let's just say that the bulk of this sentence demonstrates the
truth of its opening clause ...

Let's just say it demonstrates the paucity of the instruction set details I
peeked at before posting...

Nothing was said about any sort of widening when a destination was a
register, only that most operations were either 8 or 16 bits.
PDP-11 sign-extends (8-bit) bytes when loading them into (16-bit)
registers. The opcode is MOVB with a register destination (any of
R0..R5; it is unwise to target R6 or R7, aka SP and PC, with MOVB).

(I think I would take issue with DEC for having an instruction that does not
do what it says. So MOVB is 8 bits one end and 16 bits at the other? What
about MOVB R0,R1? INC R0? Or is the -B suffix only relevant for memory?)

If sign-extension was really something you couldn't get away from, then
perhaps it explains a couple of things about C, that no-one was bothered
with at the time because characters fit into 7 bits and it didn't matter.
 
B

bartc

Eric Sosman said:
bartc wrote:

- MOVB R0,R1 fetches the low-order eight bits of R0, places them
in the low-order eight bits of R1, and fills the high-order half
of R1 with copies of bit 7.

OK, thanks.
- INC R0 increments the sixteen-bit quantity in R0, and sets
assorted condition flags depending on the result.

I actually meant INCB, but forget it. I've already seen the instruction set
is not quite as orthogonal as I thought.
around, I found three different instructions that could be used to
return from an ordinary subroutine:
(The third was so slow that it ran longer than a pointless Usenet
thread.)

If you're referring to this one, I don't think investigating the origins of
C's quirky signed char type is such a waste of time.
 
B

BGB / cr88192

simple answer:
char is normally signed (granted, not all C compilers agree to this, as a
few older/oddball compilers have made it default to unsigned).

<--
who says it's normally signed? I've seen compilers that made it
optional.
Since it hardly ever matters I don't understand why you care.
-->

it is normally signed, since this is what a majority of the compilers on a
majority of the common architectures do.

granted, it is not safe to rely on this, and hence I often use an explicit
signed type if it really matters.

so 'char'=='character' is a misnomer (historical accident?...)

<--
its an historical fact. It's hardly a misnomer, an accident or even an
error.

char is a C type for holding characters. I agree it might have been
a good idea to have a byte type as well.
-->

but, to have it hold characters, be of a fixed size, and signed?...

I would have rather had said separate byte type, and have left "char" to be
a machine-dependent type, similar to short or int.

since for
most practical uses, ASCII and UTF-8 chars are better treated as unsigned

<--
why should ASCII be unsigned? ASCII fits in 7 bits. Even extended
ASCIIs
still manage fine as signed values.
-->

errm, not really.

in practice, extended ASCII sets are generally defined as, and assumed to
be, within the 128-255 range...


likewise, signedness will generally not mix well with things like
encoding/decoding UTF-8 chars, ...

so, it is common practice in my case to cast to "unsigned char" when doing
things involving UTF-8, ... but otherwise leave strings as the more
traditional "char *" type.

(we just use 'char' as a matter of tradition, and cast to unsigned char
wherever it matters),

<--
it never matters with character data. I use unsigned char when I'm
manipulating external representations (bytes or octets)
-->

it matters with character data if it happens to be UTF-8.

many simple strategies for working with text may mess up fairly hard if the
text is UTF-8 and things are treated as signed.

and for most other uses (where we want a signed byte),

<--
that is, hardly ever. I'm tempted to say "never" as I don't think
I've ever needed tiny little integers. But I can imagine uses
for TLIs.
-->

there are many cases, especially if one does things involving image
processing or signal processing...

one needs them much like one needs 16-bit floats, although, granted, there
are other, typically more convinient, ways of shoving floating-point values
into 8 or 16 bit quantities, in the absence of a floating point type
(typically revolving around log or sqrt...).

'fixed point' is also sometimes appropriate, but in these cases it really
depends on the data.


memory is not free, hence it matters that it not all be wasted
frivolously...

thinking of 'char' as 'character' is misleading

<--
I disagree
-->

it is misleading if your string happens to be UTF-16...

then, suddenly, char is unable to represent said characters...

even with UTF-8, 'char' is not able to represent a character, only a single
byte which could be part of a multi-byte character.

hence, the issue...

(note that there are many
cases where a signed 8-bit value actually makes some sense).

<--
such as?
-->

signal-processing related numeric functions, small geometric data, ...

you "could" store everything as floats, and then discover that one is eating
up 100s of MB of memory on data which could easily be stored in much less
space (say, 1/4 the space).


many other (newer) languages reinterpret things,

but that doesn't matter

<snip>

<--
in my own uses, I typically use typedef to define 'byte' as 'unsigned
char'

I commonly do this
and 'sbyte' as 'signed char'.

I never do this
I also use 'u8' and 's8' sometimes.

I dislike these. Seems to be letting the metal show through
-->

s8/u8, s16/u16, s32/u32, ...


these are good for defining structures where values are expected to be
specific sizes...

my (newer) x86 interpreter sub-project uses these sorts of types
extensively, mostly as, with x86 machine code, things matter down to the
single bits...


many other tasks may involve similar levels of bit-centric twiddling, and so
the naming may also hint at the possible use of bit-centric logic code...


however, for most more general tasks, I use byte and sbyte instead...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,990
Messages
2,570,211
Members
46,796
Latest member
SteveBreed

Latest Threads

Top