dump the real string

T

toylet

That's wrong. length() will return the number of characters in the string.
You mean ASCII doesn't work for Chinese?

I was talking about the length(). Where is the connection between
Chinese characters and length()?

All computer data are referenced as 8-bit bytes these days.
 
M

Martien Verbruggen

[please leave attribution in place]
I was talking about the length(). Where is the connection between
Chinese characters and length()?

length() gives the length of a strin gin characters. Chinese
characters are not stored in 8-bit bytes.
All computer data are referenced as 8-bit bytes these days.

Nonsense. And what a particular machine/implementation calls a "byte"
has very little to do with characters.

Martien
 
T

toylet

length() gives the length of a strin gin characters. Chinese
characters are not stored in 8-bit bytes.

What is a chacacter in Perl's sense?

Under the BIG5 character encoding,each chinese alphabet (or character)
is stored as two bytes. One byte always equal to 8-bits anyway.
Nonsense. And what a particular machine/implementation calls a "byte"
has very little to do with characters.

i think we need to define "character".
 
M

Martien Verbruggen

What is a chacacter in Perl's sense?

There is no simple and easy answer to that.

I think your question is probably best answered by referring you to
the perluniintro and perlunicode documentation (which come with
Perl); specifically the section titled "Byte and Character
semantics", and to advise you to read up on unicode and the various
encoding schemes that come with it.
Under the BIG5 character encoding,each chinese alphabet (or character)
is stored as two bytes. One byte always equal to 8-bits anyway.

No, it does not. An octet is 8 bits. The term "byte" is
context-sensitive and fluid. It could be 9 bits, or it could be 16 or
32 bits behind the screens. It is _generally_ 8 bits, and in certain
contexts it is always 8 bits, but this is certainly not a given in all
contexts. Wherever, for example, a byte refers to the underlying C
type char, it will be whatever the size of that type is.
i think we need to define "character".

See above.

I am assuming your thinking stems from a "in C the char type is a
character" background?

It is important to stop thinking of characters as matching C's char
type, and to stop thinking of C's char type always being 8 bits (even
though a char is always a byte).

Neither is true. Not even in C.

Martien
 
J

Jürgen Exner

toylet said:
You emant each char in a perl string is not stored as one byte?

I meant that in general it is not possible to store every character in a
single byte. Actually the vast majority of characters in the more commonly
spoken languages typically require at least two bytes to store them.

jue
 
J

Jürgen Exner

toylet said:
I was talking about the length(). Where is the connection between
Chinese characters and length()?

Maybe that a text in Chinese with 20 characters typically requires 40 bytes
to be stored?
So what to you want to know? The length of the string in characters or the
size of the allocated memory. You were asking for the memory size.
All computer data are referenced as 8-bit bytes these days.

Which means 256 distinct values which means there is just no way to encode
those tens of thousands of Chinese characters in one single byte.

jue
 
T

toylet

length() gives the length of a strin gin characters. Chinese
There is no simple and easy answer to that.
I think your question is probably best answered by referring you to
the perluniintro and perlunicode documentation (which come with
Perl); specifically the section titled "Byte and Character
semantics", and to advise you to read up on unicode and the various
encoding schemes that come with it.

You meant length() would react to unicode settings in Perl?
It is important to stop thinking of characters as matching C's char
type, and to stop thinking of C's char type always being 8 bits (even
though a char is always a byte).

I think one byte always equal to 8 bits. All computer courses taught
that. 9-bit byte? What machines do that?
 
T

toylet

Maybe that a text in Chinese with 20 characters typically requires 40 bytes
to be stored?
So what to you want to know? The length of the string in characters or the
size of the allocated memory. You were asking for the memory size.

I didn't expect my question on displaying the bytes in a string would
end up talking about multi-lingual isssues.
Which means 256 distinct values which means there is just no way to encode
those tens of thousands of Chinese characters in one single byte.

Of course.
 
M

Martien Verbruggen

You meant length() would react to unicode settings in Perl?

Have you read the documentation?
I think one byte always equal to 8 bits. All computer courses taught
that. 9-bit byte? What machines do that?

Various PDP architectures do that. 36 bit architectures. there are
other architectures that use larger power of two bytes.

Use Google, or ask in a usenet group that talks about these sorts of
things all the time. I'm done with this subject.

Martien
 
G

gnari

toylet said:
it worked as well. SO it's the default separator used by split()?

read the docs before asking things like this
perldoc -f split

it is not so much a default separator, as a special case
of the function split with no arguments, that has an
extra funtionality.

gnari
 
G

gnari

toylet said:
I didn't expect my question on displaying the bytes in a string would
end up talking about multi-lingual isssues.

it was just that someone corrected you whan you implied that length()
would give the number of bytes, which is only valid for some
encodings/systems. the thread only went on because you protested.

gnari
 
A

Andrew McGregor

toylet said:
hex() should not be relevant as I need to convert from numeric to hex
digits. Already knew about substr().

Then why use a none numeric example, 'a'?
 
J

Joe Smith

toylet said:
You meant length() would react to unicode settings in Perl?

Yes. UTF-8 encoding uses 8, 16, 24 or 32 bits per character.
I think one byte always equal to 8 bits. All computer courses taught
that. 9-bit byte? What machines do that?

Three of the first five computers connected to the ARPANET were
36-bit computers. They used 7-bit ASCII for regular text,
SIXBIT for COBOL data, strings of 5-bit codes for FORTRAN error
messages. When talking to other computers, the PDP-10 used
8-bit bytes, 9-bit bytes, 12-bit bytes, 16-bit bytes and 18-bit bytes.

A byte is defined to be a contiguous set of bits. When talking
about 8-bit bytes, the proper term is "octet".

-Joe http://www.inwap.com/pdp10/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,146
Messages
2,570,832
Members
47,374
Latest member
EmeliaBryc

Latest Threads

Top