byte count unicode string

W

willie

Marc 'BlackJack' Rintsch:
>That is the correct way.


# Apologies if I'm being dense, but it seems
# unusual that I'd have to make a copy of a
# unicode string, converting it into a byte
# string, before I can determine the size (in bytes)
# of the unicode string. Can someone provide the rational
# for that or correct my misunderstanding?

# Thanks.
 
J

John Machin

willie said:
Marc 'BlackJack' Rintsch:




# Apologies if I'm being dense, but it seems
# unusual that I'd have to make a copy of a
# unicode string, converting it into a byte
# string, before I can determine the size (in bytes)
# of the unicode string. Can someone provide the rational
# for that or correct my misunderstanding?

You initially asked "What's the correct way to get the byte countof a
unicode (UTF-8) string".

It appears you meant "How can I find how many bytes there are in the
UTF-8 representation of a Unicode string without manifesting the UTF-8
representation?".

The answer is, "You can't", and the rationale would have to be that
nobody thought of a use case for counting the length of the UTF-8 form
but not creating the UTF-8 form. What is your use case?

Cheers,
John
 
M

MonkeeSage

John said:
The answer is, "You can't", and the rationale would have to be that
nobody thought of a use case for counting the length of the UTF-8 form
but not creating the UTF-8 form. What is your use case?

Playing DA here, what if you need to send the byte-count on a server
via a header, but need the utf8 representation for the actual data?

Regards,
Jordan
 
D

Diez B. Roggisch

MonkeeSage said:
Playing DA here, what if you need to send the byte-count on a server
via a header, but need the utf8 representation for the actual data?

So what - you need it in the end, don't you?

The runtime complexity of the calculation will be the same - you have to
consider each character, so its O(n).

Of course you will roughly double the memory consumption - the original
unicode being represented as UCS2 or UCS4.

But then - if that really is a problem, how would you work with that
string anyway?

So you have to resort to slicing and computing the size of the parts,
which will remedy that easily.

Diez
 
D

Duncan Booth

MonkeeSage said:
Playing DA here, what if you need to send the byte-count on a server
via a header, but need the utf8 representation for the actual data?

Then you still need both the data and its length. John asked for an example
where you need only the length and not the data itself.

I guess you could invent something like inserting a string into a database
which has fixed size fields, silently truncates fields which are too long
and stores the strings internally in utf-8 but only accepts ucs-2 in its
interface. Pretty far fetched, but if it exists I suspect that an extra
utf-8 encoding here or there is the least of your problems.
 
P

Paul Rubin

Duncan Booth said:
I guess you could invent something like inserting a string into a database
which has fixed size fields, silently truncates fields which are too long
and stores the strings internally in utf-8 but only accepts ucs-2 in its
interface. Pretty far fetched, but if it exists I suspect that an extra
utf-8 encoding here or there is the least of your problems.

More direct would be to add an option to the http parser to return the
utf8 received from the browser as a byte array still in utf8, instead
of decoding it so that it needs to be re-encoded before insertion into
the database. A lot of the time, the application doesn't need to look
at the string anyway.
 
V

Virgil Dupras

MonkeeSage said:
OK, so the devil always loses. ;P

Regards,
Jordan

Huh? The devil always loses? *turns TV on, watches the news, turns TV
off* Nope, buddy. Quite the contrary.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,709
Latest member
AustinMudi

Latest Threads

Top