Tiny VM in C

B

BartC

Maxwell Bernstein said:
On 2/20/14, 3:51 AM, BartC wrote:
OK, I've just looked at your project. I assume you're referring to this:

typedef union carp_argument {
unsigned int i;
carp_register r;
long long ll;
char s[CARP_NAME_LENGTH];
} carp_argument;

typedef struct carp_command {
carp_instruction instr;
carp_argument args[CARP_NUM_ARGS];
} carp_command;

That's fine if it's what you want to do. Provided you realise that the
size of the carp_argument union will be that of the largest item, in
this case the 32-byte name 's'.

And since you have 3 of these in carp_command, that each instruction
will be about 100 bytes long.

I know I said, since originally each command I think was packed into a
16-bit value, that it wasn't necessarily to pack things that tightly,
you could make life a bit easier. But this is going to the other extreme!

I see what you mean; things are rather large.
But also, this is now unlike any real machine. I'm not clear what 's' is
for, but references to strings are usually handled by a pointer to a
region of memory that contains the string. (And if 's' is meant to be
the name of a variable, then in a real machine, names aren't used at
all, they are symbols in the source code (of assembler etc) that are
replaced in the machine code by addresses or offsets of the area in
memory where the variable resides.)

How would you recommend I deal with pointers? How would that be different?

For strings and names, you can just use char *s instead of char s[...]. Then
the size of that pointer will be (most likely with a 32-bit compiler) 4
bytes. The name will then be stored elsewhere. But the initialisation of the
string can be the same. (With unions, it's a bit uncertain how the thing is
initialised anyway; I think it uses the type of the first member of the
union.)

But being a VM, it can implement strings and names as it likes, including
using, instead, an index into a table of names (a table of char* in
reality). Then you don't really need 's', but can just use 'i'. A char* will
do however.

(Since this doesn't appear to relate to a real machine any more, I'll
briefly describe a VM I use, which implements the byte-code of a language.
The bytecode is represented as a linear array of int values, and might look
something like this:

C C X C C X Y C X Y Z ....

Where C is a command (an opcode), and X, Y and Z are the first, second and
third operands. So 'instructions' are variable length, but each part fits
into a 32-bit int value.

Each operand can be one of several kinds: when it is a 32-bit int, then it
is directly stored in X, Y or Z. Otherwise it will be an index into a table,
or sometimes even a pointer direct to a variable for example.

For dealing with the different interpretation of X, Y or Z, then C casts are
used as necessary (rather than unions; but if using pointers, you need to be
sure they will fit into an int! So indices are the best bet.)

The handler of each C opcode will of course need to know what type each
operand is; also how many operands there are, to allow it to properly step
the PC to the next bytecode. This is common also in real processors where
instructions are of varying length.)
 
M

Maxwell Bernstein

Looks like I can assign the string to the pointer *after the fact*, but
not assign it inline with {curly brackets}.
 
M

Maxwell Bernstein

For strings and names, you can just use char *s instead of char s[...].
Then
the size of that pointer will be (most likely with a 32-bit compiler) 4
bytes. The name will then be stored elsewhere. But the initialisation of
the
string can be the same. (With unions, it's a bit uncertain how the thing is
initialised anyway; I think it uses the type of the first member of the
union.)

When I am initializing the union, I get the following warning
(attached), and then a segfault when I run the code.

λ chaos tidbits → gcc pointer_union.c
pointer_union.c:10:19: warning: incompatible pointer to integer conversion
initializing 'int' with an expression of type 'char [6]'
[-Wint-conversion]
union Data d = {"hello"};
^~~~~~~
1 warning generated.
But being a VM, it can implement strings and names as it likes, including
using, instead, an index into a table of names (a table of char* in
reality). Then you don't really need 's', but can just use 'i'. A char*
will
do however.

I have a hash table... hmm.
 
J

James Kuyper

Looks like I can assign the string to the pointer *after the fact*, but
not assign it inline with {curly brackets}.

char *p = "This is initialization";
p = "This is assignment.";

Curly brackets come in when initializing aggregate types (arrays or
structures):

char array1[] = {'H', 'e', 'l', 'l', 'o', ' ', '\0'};

But for character types it's equivalent, and simpler, to use strings:

char array2[] = "world!\n";
 
B

BartC

(With unions, it's a bit uncertain how the
thing is
initialised anyway; I think it uses the type of the first member of the
union.)

When I am initializing the union, I get the following warning
(attached), and then a segfault when I run the code.

λ chaos tidbits → gcc pointer_union.c
pointer_union.c:10:19: warning: incompatible pointer to integer conversion
initializing 'int' with an expression of type 'char [6]'
[-Wint-conversion]
union Data d = {"hello"};
^~~~~~~
1 warning generated.

This was what I meant. It's looking for an int value (the first member of
the union). If you still have your original union, then it might store some
letters from "hello", without a zero terminator, which may well cause the
segfault.

A char* here would be easier to initialise; but you might have to put an
(int) cast in front of it. However, this will fail on a system where a char*
is wider than an int. It gets messy! Another way is to have the long long
member first, then cast to that.

Unions don't work well when you want to initialise certain fields.
 
J

James Kuyper

For strings and names, you can just use char *s instead of char s[...].
Then
the size of that pointer will be (most likely with a 32-bit compiler) 4
bytes. The name will then be stored elsewhere. But the initialisation of
the
string can be the same. (With unions, it's a bit uncertain how the thing is
initialised anyway; I think it uses the type of the first member of the
union.)

There's no uncertainty: the provided initializer is used to initialize
the first member of the union.
When I am initializing the union, I get the following warning
(attached), and then a segfault when I run the code.

λ chaos tidbits → gcc pointer_union.c
pointer_union.c:10:19: warning: incompatible pointer to integer conversion
initializing 'int' with an expression of type 'char [6]'
[-Wint-conversion]
union Data d = {"hello"};
^~~~~~~

It would appear that the first member of "union Data" has the type
'int'. Since that is not an array of character type, the string gets
converted to a pointer to the first element of the array, and there's no
implicit conversion from that pointer to int (though you could put one
in explicitly by using a cast, if that were what you wanted to do).

In C90, you should put the field you want to initialize most often at
the beginning of the union. To initialize any other member of the union,
you'd have to do so by assignment, not by initialization.

In C99, designated initializers were added. Among other things, these
allow you to initialize members of a union other than the first one:

union Data d = {.greeting = "hello"};

If greeting is a pointer, this is equivalent to

union Data d;
d.greeting = "hello";

If greeting is an array of char, it's more accurately equivalent to

union Data d;
strncpy(d.greeting, "hello", sizeof d.greeting);
 
B

BartC

James Kuyper said:
For strings and names, you can just use char *s instead of char s[...].
Then
the size of that pointer will be (most likely with a 32-bit compiler) 4
bytes. The name will then be stored elsewhere. But the initialisation of
the
string can be the same. (With unions, it's a bit uncertain how the thing is
initialised anyway; I think it uses the type of the first member of the
union.)

There's no uncertainty: the provided initializer is used to initialize
the first member of the union.
When I am initializing the union, I get the following warning
(attached), and then a segfault when I run the code.

λ chaos tidbits → gcc pointer_union.c
pointer_union.c:10:19: warning: incompatible pointer to integer
conversion
initializing 'int' with an expression of type 'char [6]'
[-Wint-conversion]
union Data d = {"hello"};
^~~~~~~

It would appear that the first member of "union Data" has the type
'int'. Since that is not an array of character type, the string gets
converted to a pointer to the first element of the array, and there's no
implicit conversion from that pointer to int (though you could put one
in explicitly by using a cast, if that were what you wanted to do).

So in the first few bytes of the char array, it gets a pointer to the string
(instead of the first few letters)?

Suppose the char array was the first member; how easy would it be to
initialise the other members through casts?
 
J

James Kuyper

James Kuyper said:
On 02/20/2014 04:11 PM, Maxwell Bernstein wrote: ....
pointer_union.c:10:19: warning: incompatible pointer to integer
conversion
initializing 'int' with an expression of type 'char [6]'
[-Wint-conversion]
union Data d = {"hello"};
^~~~~~~

It would appear that the first member of "union Data" has the type
'int'. Since that is not an array of character type, the string gets
converted to a pointer to the first element of the array, and there's no
implicit conversion from that pointer to int (though you could put one
in explicitly by using a cast, if that were what you wanted to do).

So in the first few bytes of the char array, it gets a pointer to the string
(instead of the first few letters)?


If the char array you're referring to is another member of same union,
and sufficiently long, and if 'int' is large enough to store the
complete representation of a pointer, and if the conversion from pointer
to int is defined to be representation-conserving, then what you say
would be true. The first two conditions are matters under your control.
The second two are under the implementation's control; and it's probably
not uncommon for both to be true - but the standard imposes no such
requirement.

I certainly wouldn't recommend writing code to rely upon those
assumptions. If you wanted to do something like that, put either a
intptr_t member or an actual pointer member in the union, and use it
directly.
Suppose the char array was the first member; how easy would it be to
initialise the other members through casts?

Without designated initializers, you can't initialize anything other
than the first member. However, by initialization of that member with a
carefully chosen character string, you can guarantee the value that
would be read if any of the other members were accessed - but this
requires use of implementation-specific information about how that
member is represented, and the resulting code would not be portable
anywhere where that information didn't apply.

For instance, assuming CHAR_BIT == 8 and sizeof(unsigned long)==4,

union {
char c4[4];
unsigned long ul;
} data = {"\001\002\003\004"};

is likely to give data.ul a value of either 0x01020304 or 0x04030201,
though there have been popular machines where other values might be
seen: 0x02010403 and 0x03040102 being two of the most popular alternatives.

I would NOT recommend this approach, due to being both obscure and
non-portable. However, I've known a fair number of people who would like
the idea a lot better than I do.
 
K

Keith Thompson

Maxwell Bernstein said:
When I am initializing the union, I get the following warning
(attached), and then a segfault when I run the code.

λ chaos tidbits → gcc pointer_union.c
pointer_union.c:10:19: warning: incompatible pointer to integer conversion
initializing 'int' with an expression of type 'char [6]'
[-Wint-conversion]
union Data d = {"hello"};
^~~~~~~
1 warning generated.

A union initializer that doesn't name a member defaults to the first
member, which apparently is of type int in your case.

As of C99, you can use a designated initializer:

union Data d = { .foo = "hello" };
 
M

Maxwell Bernstein

char *p = "This is initialization";
p = "This is assignment.";

Sorry, I meant initialize.
Curly brackets come in when initializing aggregate types (arrays or
structures):

char array1[] = {'H', 'e', 'l', 'l', 'o', ' ', '\0'};

But for character types it's equivalent, and simpler, to use strings:

char array2[] = "world!\n";

I have it in a union; that is giving me trouble.
 
M

Maxwell Bernstein

This was what I meant. It's looking for an int value (the first member of
the union). If you still have your original union, then it might store some
letters from "hello", without a zero terminator, which may well cause
the segfault.

A char* here would be easier to initialise; but you might have to put an
(int) cast in front of it. However, this will fail on a system where a
char*
is wider than an int. It gets messy! Another way is to have the long long
member first, then cast to that.

Unions don't work well when you want to initialise certain fields.

Ah, I see. I do have a char* but it is freaking out :-/ Darn.
 
M

Maxwell Bernstein

In C99, designated initializers were added. Among other things, these
allow you to initialize members of a union other than the first one:

union Data d = {.greeting = "hello"};

If greeting is a pointer, this is equivalent to

union Data d;
d.greeting = "hello";

If greeting is an array of char, it's more accurately equivalent to

union Data d;
strncpy(d.greeting, "hello", sizeof d.greeting);

Ah, so is there no easy way to "not care" about the type on initialization?
 
M

Maxwell Bernstein

It would appear that the first member of "union Data" has the type
So in the first few bytes of the char array, it gets a pointer to the
string (instead of the first few letters)?

Suppose the char array was the first member; how easy would it be to
initialise the other members through casts?

That would be interesting.
 
B

BartC

Maxwell Bernstein said:
Ah, so is there no easy way to "not care" about the type on
initialization?

The C99 method given at the top will work; the other examples are what it is
equivalent to, not what you have to code.

In your case, it would be {.s="Hello"}, which isn't too bad.

Alternatively, you could separate the string argument from the other three,
which can still be in a union. It will add four bytes or so to an argument,
but then you were earlier using 32 bytes for one.

The argument becomes roughly struct {union {i, r, ll}; s}, and its
initialisation might be: {0, "Hello"} for a string (and {N, NULL} for
numeric). The other three all being int types of some kind, initialising any
of them with an int constant is simpler (but I'd put ll first, to avoid
problems with setting the top half of ll).

But there are many different ways of doing this. I understand the problem is
finding a painless way of writing code sequences by hand, using C data
initialisation.
 
K

Keith Thompson

Maxwell Bernstein said:
That would be interesting.

[Please don't delete attribution lines when you post a followup.]

Using casts to initialize other union members would almost certainly be
a bad idea.

Suppose you have a union like:

union u {
int n;
char *s;
};

You can write:

union u obj = { 42 };

and it will initialize obj.n to 42, because n is the first member.
You can't use the same syntax to initialize u.s.

I think BartC is suggesting something like:

union u obj = { (int)"hello" };

That *might* appear to work. It takes a char* pointer value, converts
it to int, and stores it in the int member of the union. Can you then
access the char* member of the same union and expect it to point to the
string "hello"? The C standard certainly doesn't guarantee that.

A cast converts a value from one type to another; it doesn't just
reinterpret the representation. Conversions between a pointer and
pointer, or between a pointer and an integer of the same size,
*commonly* do just that, but you shouldn't depend on it.

A clearer example: if a union contains an integer and a floating-point
value, then a cast *certainly* doesn't just reinterpret the
representation; it converts a number from one representation to another.

There's no need to use a cast like that. If your compiler supports
designated initializers (standard since C99), you can directly
initialize any member you want. If not, you can just assign to that
member.
 
M

Maxwell Bernstein

But there are many different ways of doing this. I understand the
problem is finding a painless way of writing code sequences by hand,
using C data initialisation.

What's the best and painless way to go about this? I'd like to have the
simplest code possible.
 
B

BartC

Maxwell Bernstein said:
What's the best and painless way to go about this? I'd like to have the
simplest code possible.

The approach using the C99 method seems reasonable enough (and you seem to
have adopted this method in your latest code).

But if you're going to be writing a lot of code for this VM, you might want
to start looking at some sort of assembly language for it. Then the code
will change from looking like this:

{CARP_INSTR_LOADI, {{.r=CARP_REG0}, {.ll=0}}}, // loadi r0 0

to just this:

loadi r0,0

Clearly much shorter and simpler, and you don't need the comment! A language
will also take care of labels (I think at present you have to count
instructions and insert the index, that makes mods much harder).

The trouble is, an assembler is a *lot* of work, probably bigger than your
project at the moment. So forgetting that for the time being, you can at
least shorten some of the names; the above line can be:

{CI_LOADI, {{.r=REG0}, {.ll=0}}}, // loadi r0 0

(I'm not familiar with how the different arguments are used, but I might
lose the distinction between .i and .r at least, maybe also .ll; just have a
single long long integer argument (as well as the char*), and have this
first in the union, so that most lines don't need a prefix, and this example
becomes:

{CI_LOADI, {{REG0}, {0}}},

This is now clear enough that you can dispense with the comment! A shame
about the inner {,}, that's because these are still unions.)

For labels, it's not clear what you do now, but I might introduce a new
command to define a label:

{CI_LABEL, {{.s="loop"}}}, // loop:

and then you can use this label as:

{CI_JUMP, {{.s="loop"}}}, // jump loop

Some simple pre-processing can then convert label references such as "loop",
to the index of the CI_LABEL command. (And perhaps can also remove the
CI_LABEL, which is otherwise a NOP, although this is tricky as all indices
will change too).

(I've used strings for the labels, but you can just use integers too, and
the pre_processing is simpler:

{CI_LABEL, {{100}}}, // L100:
{CI_JUMP, {{100}}}, // jump L100 )
 
B

Ben Bacarisse

BartC said:
The approach using the C99 method seems reasonable enough (and you seem to
have adopted this method in your latest code).

But if you're going to be writing a lot of code for this VM, you might want
to start looking at some sort of assembly language for it. Then the code
will change from looking like this:

{CARP_INSTR_LOADI, {{.r=CARP_REG0}, {.ll=0}}}, // loadi r0 0

An alternative is the use the pre-processor. You can use the ## token
joining operator to write this:

#define I2(op,rn,v) \
{CARP_INSTR_##op, {{.r=CARP_REG##rn}, {.ll=(v)}}}

I2(LOADI, 0, 0)

You'd need variations on this theme depending on what sort of
instruction is being generated. Probably good enough for simple
testing.

<snip>
 
M

Maxwell Bernstein

The approach using the C99 method seems reasonable enough (and you seem to
have adopted this method in your latest code).

Yeah, I took your suggestion.
Then the code will change from looking like this:

{CARP_INSTR_LOADI, {{.r=CARP_REG0}, {.ll=0}}}, // loadi r0 0

to just this:

loadi r0,0

Indeed. I plan on writing a lexer & parser soon. The comments are for
other people at the moment, or future me.
Clearly much shorter and simpler, and you don't need the comment! A
language
will also take care of labels (I think at present you have to count
instructions and insert the index, that makes mods much harder).

At the moment, I just use the array index.
The trouble is, an assembler is a *lot* of work, probably bigger than your
project at the moment. So forgetting that for the time being, you can at
least shorten some of the names; the above line can be:

{CI_LOADI, {{.r=REG0}, {.ll=0}}}, // loadi r0 0

(I'm not familiar with how the different arguments are used, but I might
lose the distinction between .i and .r at least, maybe also .ll; just
have a
single long long integer argument (as well as the char*), and have this
first in the union, so that most lines don't need a prefix, and this
example
becomes:

{CI_LOADI, {{REG0}, {0}}},

Yes and no; I'd like to keep the prefixes so that the namespace is
decently clean. I'm not bothered by the length so much. The difference
between .r and .i is that .i is used explicitly for integer operations.
I have removed the only use and field in the union. I use the .r to
differentiate between "register" and "value". As far as I can tell, it
does not take up any more space in the union.
This is now clear enough that you can dispense with the comment! A shame
about the inner {,}, that's because these are still unions.)

For labels, it's not clear what you do now, but I might introduce a new
command to define a label:

{CI_LABEL, {{.s="loop"}}}, // loop:

and then you can use this label as:

{CI_JUMP, {{.s="loop"}}}, // jump loop

I do like this idea, though my first thoughts are:
a) why not just have a hash table?
b) why not just use numbers?
Some simple pre-processing can then convert label references such as
"loop",
to the index of the CI_LABEL command. (And perhaps can also remove the
CI_LABEL, which is otherwise a NOP, although this is tricky as all
indices will change too).

(I've used strings for the labels, but you can just use integers too, and
the pre_processing is simpler:

{CI_LABEL, {{100}}}, // L100:
{CI_JUMP, {{100}}}, // jump L100 )

Aha.
 
M

Maxwell Bernstein

An alternative is the use the pre-processor. You can use the ## token
joining operator to write this:

#define I2(op,rn,v) \
{CARP_INSTR_##op, {{.r=CARP_REG##rn}, {.ll=(v)}}}

I2(LOADI, 0, 0)

You'd need variations on this theme depending on what sort of
instruction is being generated. Probably good enough for simple
testing.

<snip>

Well damn, I didn't think of that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,075
Messages
2,570,553
Members
47,197
Latest member
NDTShavonn

Latest Threads

Top