dump the real string

toylet · Feb 26, 2004

Tad said:
What happened when you tried it?

it worked as well. SO it's the default separator used by split()?

toylet · Feb 26, 2004

That's wrong. length() will return the number of characters in the string.

You mean ASCII doesn't work for Chinese?

I was talking about the length(). Where is the connection between
Chinese characters and length()?

All computer data are referenced as 8-bit bytes these days.

Martien Verbruggen · Feb 26, 2004

[please leave attribution in place]

I was talking about the length(). Where is the connection between
Chinese characters and length()?

length() gives the length of a strin gin characters. Chinese
characters are not stored in 8-bit bytes.

All computer data are referenced as 8-bit bytes these days.

Nonsense. And what a particular machine/implementation calls a "byte"
has very little to do with characters.

Martien

toylet · Feb 26, 2004

length() gives the length of a strin gin characters. Chinese

characters are not stored in 8-bit bytes.

What is a chacacter in Perl's sense?

Under the BIG5 character encoding,each chinese alphabet (or character)
is stored as two bytes. One byte always equal to 8-bits anyway.

Nonsense. And what a particular machine/implementation calls a "byte"
has very little to do with characters.

i think we need to define "character".

Martien Verbruggen · Feb 26, 2004

What is a chacacter in Perl's sense?

There is no simple and easy answer to that.

I think your question is probably best answered by referring you to
the perluniintro and perlunicode documentation (which come with
Perl); specifically the section titled "Byte and Character
semantics", and to advise you to read up on unicode and the various
encoding schemes that come with it.

Under the BIG5 character encoding,each chinese alphabet (or character)
is stored as two bytes. One byte always equal to 8-bits anyway.

No, it does not. An octet is 8 bits. The term "byte" is
context-sensitive and fluid. It could be 9 bits, or it could be 16 or
32 bits behind the screens. It is _generally_ 8 bits, and in certain
contexts it is always 8 bits, but this is certainly not a given in all
contexts. Wherever, for example, a byte refers to the underlying C
type char, it will be whatever the size of that type is.

i think we need to define "character".

See above.

I am assuming your thinking stems from a "in C the char type is a
character" background?

It is important to stop thinking of characters as matching C's char
type, and to stop thinking of C's char type always being 8 bits (even
though a char is always a byte).

Neither is true. Not even in C.

Martien

Jürgen Exner · Feb 26, 2004

toylet said:
You emant each char in a perl string is not stored as one byte?

I meant that in general it is not possible to store every character in a
single byte. Actually the vast majority of characters in the more commonly
spoken languages typically require at least two bytes to store them.

jue

Jürgen Exner · Feb 26, 2004

toylet said:
I was talking about the length(). Where is the connection between
Chinese characters and length()?

Maybe that a text in Chinese with 20 characters typically requires 40 bytes
to be stored?
So what to you want to know? The length of the string in characters or the
size of the allocated memory. You were asking for the memory size.

All computer data are referenced as 8-bit bytes these days.

Which means 256 distinct values which means there is just no way to encode
those tens of thousands of Chinese characters in one single byte.

jue

toylet · Feb 26, 2004

length() gives the length of a strin gin characters. Chinese

There is no simple and easy answer to that.
I think your question is probably best answered by referring you to
the perluniintro and perlunicode documentation (which come with
Perl); specifically the section titled "Byte and Character
semantics", and to advise you to read up on unicode and the various
encoding schemes that come with it.

You meant length() would react to unicode settings in Perl?

It is important to stop thinking of characters as matching C's char
type, and to stop thinking of C's char type always being 8 bits (even
though a char is always a byte).

I think one byte always equal to 8 bits. All computer courses taught
that. 9-bit byte? What machines do that?

toylet · Feb 26, 2004

Maybe that a text in Chinese with 20 characters typically requires 40 bytes

to be stored?
So what to you want to know? The length of the string in characters or the
size of the allocated memory. You were asking for the memory size.

I didn't expect my question on displaying the bytes in a string would
end up talking about multi-lingual isssues.

Which means 256 distinct values which means there is just no way to encode
those tens of thousands of Chinese characters in one single byte.

Of course.

Martien Verbruggen · Feb 26, 2004

You meant length() would react to unicode settings in Perl?

Have you read the documentation?

I think one byte always equal to 8 bits. All computer courses taught
that. 9-bit byte? What machines do that?

Various PDP architectures do that. 36 bit architectures. there are
other architectures that use larger power of two bytes.

Use Google, or ask in a usenet group that talks about these sorts of
things all the time. I'm done with this subject.

Martien

gnari · Feb 26, 2004

toylet said:
it worked as well. SO it's the default separator used by split()?

read the docs before asking things like this
perldoc -f split

it is not so much a default separator, as a special case
of the function split with no arguments, that has an
extra funtionality.

gnari

gnari · Feb 26, 2004

toylet said:
I didn't expect my question on displaying the bytes in a string would
end up talking about multi-lingual isssues.

it was just that someone corrected you whan you implied that length()
would give the number of bytes, which is only valid for some
encodings/systems. the thread only went on because you protested.

gnari

Andrew McGregor · Feb 26, 2004

toylet said:
hex() should not be relevant as I need to convert from numeric to hex
digits. Already knew about substr().

Then why use a none numeric example, 'a'?

Joe Smith · Feb 26, 2004

toylet said:
You meant length() would react to unicode settings in Perl?

Yes. UTF-8 encoding uses 8, 16, 24 or 32 bits per character.

I think one byte always equal to 8 bits. All computer courses taught
that. 9-bit byte? What machines do that?

Three of the first five computers connected to the ARPANET were
36-bit computers. They used 7-bit ASCII for regular text,
SIXBIT for COBOL data, strings of 5-bit codes for FORTRAN error
messages. When talking to other computers, the PDP-10 used
8-bit bytes, 9-bit bytes, 12-bit bytes, 16-bit bytes and 18-bit bytes.

A byte is defined to be a contiguous set of bits. When talking
about 8-bit bytes, the proper term is "octet".

-Joe http://www.inwap.com/pdp10/

dump the content of a string	3	Feb 25, 2004
Building a real estate website	2	Mar 15, 2021
Problem Splitting Text String	2	Dec 29, 2022
hex dump w/ or w/out utf-8 chars	40	Jul 8, 2013
Tic Tac Toe Game	2	Mar 10, 2024
How can I rectify a string that begins and ends with a backtick, while ensuring that the contents of the evaluated expression also contain backticks?	2	Mar 20, 2024
A problem in viewing the output!	2	Jun 14, 2024
Traceback (most recent call last): File "<string>", line 23, in <module>TypeError: '>' not supported between instances of 'complex' and 'in	1	Dec 1, 2023

dump the real string

toylet

toylet

Martien Verbruggen

toylet

Martien Verbruggen

Jürgen Exner

Jürgen Exner

toylet

toylet

Martien Verbruggen

gnari

gnari

Andrew McGregor

Joe Smith

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads