knazarov.com/content/posts/simple_dynamic_language_vm/note.md

X-Date: 2023-07-28T21:00:00Z
X-Note-Id: 67c55b15-462a-4bfd-8b6c-277535615938
Subject: Simple VM for dynamic languages
X-Slug: simple_dynamic_language_vm

A few weeks ago, I've started working on a lisp interpreter. I already did a few implementations of lisp in different
languages, but those were mostly just recursive evaluators. This time, it's a bit more serious.

Instead of writing the implementation top-to-bottom, I started with a virtual machine. Virtual machines are used to
execute most of the scripting languages, since bytecode is more compact and faster to evaluate. Compared to a tree-walker,
bytecode VMs are better for branch prediction and friendlier to the CPU cache.

Because the VM is very barebones at the moment, and no language is written on top of it, I created a very simple assembly
language. Here's an example of computing a factorial function in it:

```
li r1, 10
li r2, 1
li r0, 1

factorial:
    mul r0, r0, r2
    addi r2, r2, 1
    jle r2, r1, factorial
```

This code computes `10!` and returns it in the `r0` register.

The architecture of the VM is "load/store", meaning that computation (addition, multiplication, conditions, etc...) can only
be performed on registers. Data can be loaded from the memory to registers with separate instructions. This contrasts a bit
with how some VMs are implemented: for some reason many of them don't use registers at all, and instead rely only on the stack.
For example, this is what a factorial function would look like in Python bytecode:

```
  2     >>    0 LOAD_FAST                0 (N)
              3 LOAD_CONST               1 (1)
              6 COMPARE_OP               2 (==)
              9 POP_JUMP_IF_FALSE       16

  3          12 LOAD_FAST               1 (result)
             15 RETURN_VALUE

  4     >>   16 LOAD_FAST                0 (N)
             19 LOAD_CONST               1 (1)
             22 BINARY_SUBTRACT
             23 LOAD_FAST                0 (N)
             26 LOAD_FAST                1 (result)
             29 BINARY_MULTIPLY
             30 STORE_FAST               1 (result)
             33 STORE_FAST               0 (N)
             36 JUMP_ABSOLUTE            0
```

If you look carefully, you'd notice that there are no registers here. This is because all operations that write something,
usually do so to the top of the stack.

There are a few problems I find with the stack machines:

- It is unnatural to read the disassembly (you have to keep track of changing stack offsets instead of register names)
- Either instructions waste space, or we deal with variable-widths instructions (as is the case for Python)
- Some potential for optimizations is wasted

In my personal opinion, a good virtual machine for a dynamic language should be also a suitable target for compiling regular
expression state machines.

So instead, I opted for a more traditional approach, that is similar to a RISC CPU:

- 32-bit constant-width instructions
- flexible stack
- 32 registers, most of which are general-purpose, except for frame pointer/instruction pointer/etc...
- 64-bit width for both register and stack entries

The only twist that I've added compared to the "normal" CPUs is register and stack tagging. With physical processors,
it is often the case that software is written in strongly-typed languages, where data types are known during compile time,
and thus the compiler can generate specific instructions for handling, say, `int32` vs `int64`.

Consider the following code:

```
mul r3, r2, r1
```

It is essentially equivalent to `r3 = r2 * r1`. But what types do `r1` and `r2` have? Well, in case of my virtual machine,
registers and stack entries "know" their type. So if you attempt to multiply an `int32` with `uint64`, you'd get standard
type promotion, and the result would be tagged as `uint64`. Because of this, you don't have to perform checks on the bytecode
level.

So, how fast is this approach? In my preliminary tests, a simple loop from 1 to 100 million, with a multiplication inside,
takes 0.7 seconds to complete. Which is plenty fast, considering that the VM implementation is naive and has never been
seriously optimized.

The code for the experimental version can be found [here](https://git.knazarov.com/knazarov/valeri).