String exceeding length - Getting absolute string length

  • Thread starter james.w.appleby
  • Start date
J

james.w.appleby

Hello,

I am having a problem when inputting very long strings into a database.
The application I am writing can use different databases (thanks to
the wonders of JDBC) so this issue has been causing problems on both
Oracle and SQL Server.

Because one of the design objects was to support any JDBC compatible
database, a concern was raised about text widths. It was therefore
decided that the maximum column width for a VARCHAR would be a
configurable value. We theoretically knew that data could be more than
a single line so we introduced a sequence number to allow multiple
rows. (Don't ask me why we didn't use CLOBs instead, this is the
schema I'm stuck with.)

We now need to store base64 data in the same fields. The problem is
that in an example 4000 characters as defined by the Java string
object, its physical size is approximently 4430. This seems to be
because of the amount of mark-up involved, either in the base64 data or
possibly with the text between.

It occurs to me that while a non-ASCII value many be only a single
character in a unicode string, it is 6 characters in UTF-8. Therefore
I'm looking for a way of calculates the absolute length, rather than a
count of characters.

Is this possible or will I have to change the schema?
 
H

Hybris

Il Tue, 09 Jan 2007 04:34:45 -0800, james.w.appleby ha scritto:

I'm looking for a way of calculates the absolute length, rather than a
count of characters.

see String method getBytes
 
I

Ian Wilson

It occurs to me that while a non-ASCII value many be only a single
character in a unicode string,

I think you mean the opposite, that an ASCII (not non-ASCII) character
will be represented in UTF-8 using a single *byte*.
it is 6 characters in UTF-8.

No it isn't. UTF-8 uses a *variable* number of *bytes* for one Unicode
character.
Therefore
I'm looking for a way of calculates the absolute length, rather than a
count of characters.

String has a getBytes() method for this purpose.
 
O

Oliver Wong

Ian Wilson said:
I think you mean the opposite, that an ASCII (not non-ASCII) character
will be represented in UTF-8 using a single *byte*.


No it isn't. UTF-8 uses a *variable* number of *bytes* for one Unicode
character.

And even then, UTF-8 only ranges from 1 to 4 octects. The values start
at 0x000000 and go to 0x10FFFF.

- Oliver
 
J

John W. Kennedy

Oliver said:
And even then, UTF-8 only ranges from 1 to 4 octects. The values start
at 0x000000 and go to 0x10FFFF.

CESU-8 and Java's "Modified UTF-8" use as many as six, because they
first encode characters above U+FFFF as UTF-16, and then UTF-8 encode
the result. "UTF-8", albeit wrongly, is often taken to include one or
both of those schemes, so the incorrect figure of 6 is often encountered.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,739
Latest member
Clint8040

Latest Threads

Top