String exceeding length - Getting absolute string length

james.w.appleby · Jan 9, 2007

Hello,

I am having a problem when inputting very long strings into a database.
The application I am writing can use different databases (thanks to
the wonders of JDBC) so this issue has been causing problems on both
Oracle and SQL Server.

Because one of the design objects was to support any JDBC compatible
database, a concern was raised about text widths. It was therefore
decided that the maximum column width for a VARCHAR would be a
configurable value. We theoretically knew that data could be more than
a single line so we introduced a sequence number to allow multiple
rows. (Don't ask me why we didn't use CLOBs instead, this is the
schema I'm stuck with.)

We now need to store base64 data in the same fields. The problem is
that in an example 4000 characters as defined by the Java string
object, its physical size is approximently 4430. This seems to be
because of the amount of mark-up involved, either in the base64 data or
possibly with the text between.

It occurs to me that while a non-ASCII value many be only a single
character in a unicode string, it is 6 characters in UTF-8. Therefore
I'm looking for a way of calculates the absolute length, rather than a
count of characters.

Is this possible or will I have to change the schema?

Hybris · Jan 9, 2007

Il Tue, 09 Jan 2007 04:34:45 -0800, james.w.appleby ha scritto:

I'm looking for a way of calculates the absolute length, rather than a
count of characters.

see String method getBytes

Ian Wilson · Jan 10, 2007

It occurs to me that while a non-ASCII value many be only a single
character in a unicode string,

I think you mean the opposite, that an ASCII (not non-ASCII) character
will be represented in UTF-8 using a single *byte*.

it is 6 characters in UTF-8.

No it isn't. UTF-8 uses a *variable* number of *bytes* for one Unicode
character.

Therefore
I'm looking for a way of calculates the absolute length, rather than a
count of characters.

String has a getBytes() method for this purpose.

Manfred Rosenboom · Jan 10, 2007

Hi James,

Maybe the following Sun Tech Tip is worth reading by you:

Tech Tip #1: How long is your String object?
http://java.sun.com/mailers/techtips/corejava/2006/tt0822.html#1

Best,
Manfred

Oliver Wong · Jan 10, 2007

Ian Wilson said:
I think you mean the opposite, that an ASCII (not non-ASCII) character
will be represented in UTF-8 using a single *byte*.

No it isn't. UTF-8 uses a *variable* number of *bytes* for one Unicode
character.

And even then, UTF-8 only ranges from 1 to 4 octects. The values start
at 0x000000 and go to 0x10FFFF.

- Oliver

John W. Kennedy · Jan 11, 2007

Oliver said:
And even then, UTF-8 only ranges from 1 to 4 octects. The values start
at 0x000000 and go to 0x10FFFF.

CESU-8 and Java's "Modified UTF-8" use as many as six, because they
first encode characters above U+FFFF as UTF-16, and then UTF-8 encode
the result. "UTF-8", albeit wrongly, is often taken to include one or
both of those schemes, so the incorrect figure of 6 is often encountered.

Total string length regex	6	Nov 11, 2010
length of array & String usage	4	Jun 5, 2007
Content and Length of title	9	Apr 23, 2012
Runtime.exec() String length	6	Dec 16, 2005
String and list error while running a Markov Chain	1	Aug 26, 2020
The lookup table length is wrong in this description?	0	Jul 8, 2013
string length and newlines	16	Jan 10, 2008
VHDL, arbitrary string length	3	Apr 16, 2008

String exceeding length - Getting absolute string length

james.w.appleby

Hybris

Ian Wilson

Manfred Rosenboom

Oliver Wong

John W. Kennedy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads