Adding an unsigned byte type to the JVM

R

Roedy Green

I poked for a short while around in the goldfish bowl book to see if I
could figure out what would be needed to change the JVM if an unsigned
byte type were added to java.

Here is what I noticed.

Arithmetic works on the stack with ints and long. I did not see any op
codes to do arithmetic on byte, char or short.

There is a byte array load (baload) but I did not see a corresponding
load for a single isolated byte (bload). I think single bytes must be
stored as full ints. There is an op code i2b that corrals a value into
the range -128..127. I believe this acts like a & 0xff followed by a
sign extend. I think signed bytes are stored with their high order
bits already in place! There is thus no need for byte sign extension
on load for isolated bytes.

I suspect char and short work the same way.

So to add unsigned byte support you would need an new unsignedbyte
array load (ubload) and an new i2ub, which just does a & 0xff. You
also need to add a code for unsigned byte method signatures in the
class files.

You could do it with even fewer changes to the JVM by the compiler
inserting and 0xff after every byte array load, and doing an iand 0xff
prior to an unsigned byte store, rather than the usual i2b.

I wonder if anyone more familiar with the JVM internals could confirm
this. If you are a newbie wanting to get your feet wet with
understanding the JVM, you could write some code and decompile it with
Javah to see how arithmetic with bytes, chars, ints and shorts and
longs is compiled.

see http://mindprod.com/jgloss/jasm.html
for information on how the JVM works.
 
M

Mark Bottomley

Roedy Green said:
I poked for a short while around in the goldfish bowl book to see if I
could figure out what would be needed to change the JVM if an unsigned
byte type were added to java.

Here is what I noticed.

Arithmetic works on the stack with ints and long. I did not see any op
codes to do arithmetic on byte, char or short.

There is a byte array load (baload) but I did not see a corresponding
load for a single isolated byte (bload). I think single bytes must be
stored as full ints. There is an op code i2b that corrals a value into
the range -128..127. I believe this acts like a & 0xff followed by a
sign extend. I think signed bytes are stored with their high order
bits already in place! There is thus no need for byte sign extension
on load for isolated bytes.

I suspect char and short work the same way.

So to add unsigned byte support you would need an new unsignedbyte
array load (ubload) and an new i2ub, which just does a & 0xff. You
also need to add a code for unsigned byte method signatures in the
class files.

You could do it with even fewer changes to the JVM by the compiler
inserting and 0xff after every byte array load, and doing an iand 0xff
prior to an unsigned byte store, rather than the usual i2b.

I wonder if anyone more familiar with the JVM internals could confirm
this. If you are a newbie wanting to get your feet wet with
understanding the JVM, you could write some code and decompile it with
Javah to see how arithmetic with bytes, chars, ints and shorts and
longs is compiled.

see http://mindprod.com/jgloss/jasm.html
for information on how the JVM works.

Your right about the implementation internals. The stack is nominally
32-bits wide (Other magic may happen on 64-bit platforms) and the
internal representation of all bytes, shorts, chars are 32-bit signed
values.
These shorter types are also stored in objects as 32-bit entities by all
implementations. The only time they are (possibly) stored in a packed
configuration are aggregate mono-types (arrays and strings).

The solution, as you noted, is to use iand to trim the values upon loading.
The baload/store are sufficient to support unsigned bytes as the store
only truncates the current 32-bit value (i2b is not necessary). The load is
sign extended for bytes and shorts (not chars), so the iand with 0xff is
sufficient to restore the original representation. You may also need to
do the same trimming if you perform long arithmetic chain calculations
where overflow into negatives may cause problems. Again, all internal
operations on these values are done as 32-bits and individual instance
and static fields of these types are typically implemented as 32-bit slots.

Mark...
 
T

Thomas Hawtin

Mark said:
Your right about the implementation internals. The stack is nominally
32-bits wide (Other magic may happen on 64-bit platforms) and the
internal representation of all bytes, shorts, chars are 32-bit signed
values.
These shorter types are also stored in objects as 32-bit entities by all
implementations. The only time they are (possibly) stored in a packed
configuration are aggregate mono-types (arrays and strings).

You are obvious far more knowledgeable in this area than me, but in the
interests of having my misapprehensions corrected and at the risk of
making myself look a fool:

The JVM spec is clearly very 32-bit-centric. I was under the impression
that for the Sun J2SE 1.4.2 JVM the 64-bit handling was integrated into
the source core. As part of that work instance fields were (lightly)
packed, in a deterministic way. That was my fallible memory, anyway.

I'm interested to see you mention storing strings in a packed
configuration. It seems for such a common object that giving it a
special array-like structure could reduce memory requirements and
improve performance. Do many J2ME implementations do that? IIRC, JStar
can use the modified-UTF-8 data within class files directly.



If unsigned integers were added to the language (far too much
complication to the type system, IMO), then I would imagine the
specification would be changed using similar tricks to generics. You
don't actually need to be so invasive into bytecode.

Obviously you can handle unsigned values inelegantly in Java today. Most
operations could compile to the same bytecode. No need for a 1-1 mapping
from Java operation to bytecode instruction.

Methods are awkward. Either add extra letters to the signature
vocabulary, or take the generics route and overload based on the erased
type with extra compiler/reflection information in class file attributes.

newarray is awkward. Superficially, for unsigned creation could be
delegated to an API function with native implementation. However,
disruptive changes would be needed to the JVM fundamentals in order to
support run time type identification of the new primitive array types.
Again, the issue could be ignored using a form of erasure.

Tom Hawtin
 
M

Mark Bottomley

Thomas Hawtin said:
You are obvious far more knowledgeable in this area than me, but in the
interests of having my misapprehensions corrected and at the risk of
making myself look a fool:

The JVM spec is clearly very 32-bit-centric. I was under the impression
that for the Sun J2SE 1.4.2 JVM the 64-bit handling was integrated into
the source core. As part of that work instance fields were (lightly)
packed, in a deterministic way. That was my fallible memory, anyway.

I'm interested to see you mention storing strings in a packed
configuration. It seems for such a common object that giving it a special
array-like structure could reduce memory requirements and improve
performance. Do many J2ME implementations do that? IIRC, JStar can use the
modified-UTF-8 data within class files directly.



If unsigned integers were added to the language (far too much complication
to the type system, IMO), then I would imagine the specification would be
changed using similar tricks to generics. You don't actually need to be so
invasive into bytecode.

Obviously you can handle unsigned values inelegantly in Java today. Most
operations could compile to the same bytecode. No need for a 1-1 mapping
from Java operation to bytecode instruction.

Methods are awkward. Either add extra letters to the signature vocabulary,
or take the generics route and overload based on the erased type with
extra compiler/reflection information in class file attributes.

newarray is awkward. Superficially, for unsigned creation could be
delegated to an API function with native implementation. However,
disruptive changes would be needed to the JVM fundamentals in order to
support run time type identification of the new primitive array types.
Again, the issue could be ignored using a form of erasure.

Tom Hawtin

Tom:

I cannot speak to what Sun does, I'm more familiar with IBM's most
recent VM's. The 64-bit-ness usually means that the stacks are 64-bits wide
and both 32 and 64 bit entries now only take one stack slot each. Local
variables are usually implemented the same way (but no savings found)
e.g. on 32-bits a double/long occupies local n and local n+1, but on 64-bits
a double/long occupies only local n and local n+1 is unused. This is
transparent to the programmer as verification makes sure that
doubles/longs are not accessed in a split manner - so they can't even
find out it is being done. The trimming of 32-bit entries stored in
64-bit slots is not generally necessary except for the output routines
and sign extend on array loads (iaload).

The packing of 32-bit entries into 64-bit chunks of an object are
implementation dependent - a space optimization.

As for strings, I may have confused you. They arrive in the class files
as UTF-8's. The implementations may implement strings caches to
remove redundancy inter-classfile - "interning". e.g. the signature "()V"
will occur in many class files as will "java/lang/Object", so creating a
common
store is usually a win. The UTF-8 strings are converted to arrays of 16-bit
Unicode when they are manipulated at the Java level. When resolving
methods and fields, the comparisons are usually carried out with UTF-8's.
There is probably more to be gained by common sub-string removal.
e.g. "java/lang/Object" and "java/lang/Throwable" have "java/lang/" in
common. Some obfuscators and optimizers do common
sub-string compression to take advantage of this, but that is a different
problem.

As for unsigned usage, you can write most of the features by performing
the previously mentioned "anding" with a constant of the desired size while
performing the math in a larger size. e.g. unsigned 32-bit could be created
and manipulated in a signed 64-bit memory slot and stored/loaded in
a 32-bit array with anding upon load. You would need to add a method
to output the unsigned numbers from the signed presentation. For really
big numbers, IIRC there is a BigInteger package, but I don't know the
details.

This stuff could be done as a Java compiler modification, the difficulty
would be the conditionals as they assume signed numbers. It would require
some bytecode extensions to be as efficient as the signed native 32-bit.

In conclusion, it is possible to support unsigned numbers, but would
not be very efficient and may not add great value (but there are ~53
unassigned byte codes in the JVM spec.;-) )

Mark...
 
T

Thomas Hawtin

Mark said:
This stuff could be done as a Java compiler modification, the difficulty
would be the conditionals as they assume signed numbers. It would require
some bytecode extensions to be as efficient as the signed native 32-bit.

That only really count when bytecode is interpreted. Even then it can be
rewritten betwixt downloading and execution.
In conclusion, it is possible to support unsigned numbers, but would
not be very efficient and may not add great value (but there are ~53
unassigned byte codes in the JVM spec.;-) )

And some of those get used internally. An interesting one just added to
Mustang puts a finalisable object straight on the finaliser queue,
entirely skipping the garbage collector and reference handler.

Tom Hawtin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,740
Latest member
JudsonFrie

Latest Threads

Top