Bytecode Instructions
Bytecode Definition
A bytecode is a sequence of bytes. We always have one opcode
that is followed by zero or more operands
.
Lets take a look at an example. We have a simple bytecode. We push two bytes onto the stack and add them together.
The code would look like this:
# Stack []
bpush 1
# Stack [1]
bpush 2
# Stack [1, 2]
badd
# Stack [3]
The bytecode would look like this (byte values are written in hexadecimal):
01 01 01 02 10
Lets group the bytes a little bit to make it more readable:
01 01
01 02
10
We start the interpretation at the first byte. It is the opcode bpush
. The interpreter knows that the opcode
bpush
is always followed by a operand. So it reads the next byte and interprets it as the operand. The instruction
tells it to push this byte onto the stack. The pointer is now at the third byte. As we are finished with the first instruction by now, the next byte is interpreted as an opcode again.
(The interpreter will do the same for the next instruction, it is bpush
again, so it will do basically the same as
for the first instruction.)
After the second bpush
instruction, the pointer is at the fifth byte. The next byte is the opcode badd
. The
interpreter knows that the opcode badd
is not followed by any operand. The instruction tells it to add the two bytes
on top of the stack together and push the result onto the stack. The pointer is incremented again. This code is bad as the pointer is now at the end of the bytecode, but we have no RET
instruction. The interpreter will throw an error. But for this example that should be enough.
Keep in mind, that mistakes in the bytecode can lead to undefined behavior. If we have a byte missing, value bytes can be interpreted as opcodes and vice versa. This is a serious problem and leads to many security issues, especially when we can modify operand bytes during runtime. So be careful when you write your own bytecode (or use tools and not write the bytes by hand).
In the following sections we will define the opcodes and operands they are followed by.
Stack and variable manipulation
The stack hereby refers to a stack of 8-bit-values. We can only put an 8-bit-value (further referred to as a byte
,
not to be with the data type byte
) on top of the stack and we can remove the topmost bit from the stack. We will
refer to the topmost byte as the top/head
of the stack. The stack is a LIFO (last in, first out) data structure.
The stack is our main tool to manipulate data. We can use an instruction to push a constant onto the stack, then we push another constant onto the stack. Now we can use an add instruction to add the two constants together. Our stack will no longer contain the two constants, but the result of the addition. This would look like this:
# Stack []
bpush 1
# Stack [1]
bpush 2
# Stack [1, 2]
badd
# Stack [3]
Be aware that the stack is a stack of 8-bit-values. When we push a short on the stack, the short consists of 16 bits, so we will have 2 bytes on our stack now. Here the above example with a short:
# Stack []
spush 1
# Stack [0, 1]
spush 2
# Stack [0, 1, 0, 2]
sadd
# Stack [0, 3]
In this way we could also combine two bytes to a short, or we could use only won spush to push two bytes onto the stack. This example performs the same operation with bytes, but using spush instead of bpush:
# Stack []
spush 0x0102
# Stack [1, 2]
badd
# Stack [3]
Now we will look at the local variable table. The local variable table is a table of 8-bit-values. We can store a byte in
the local variable table and we can load a byte from the local variable table. The local variable table is a table of
bytes, so we can only store a byte one position. If we wan't to store a 16-bit-value, we have to store it in two indices. So a short
occupies two bytes, an int
four and so on.