92 lines
4.2 KiB
Markdown
92 lines
4.2 KiB
Markdown
X-Date: 2023-07-28T21:00:00Z
|
|
X-Note-Id: 67c55b15-462a-4bfd-8b6c-277535615938
|
|
Subject: Simple VM for dynamic languages
|
|
X-Slug: simple_dynamic_language_vm
|
|
|
|
A few weeks ago, I've started working on a lisp interpreter. I already did a few implementations of lisp in different
|
|
languages, but those were mostly just recursive evaluators. This time, it's a bit more serious.
|
|
|
|
Instead of writing the implementation top-to-bottom, I started with a virtual machine. Virtual machines are used to
|
|
execute most of the scripting languages, since bytecode is more compact and faster to evaluate. Compared to a tree-walker,
|
|
bytecode VMs are better for branch prediction and friendlier to the CPU cache.
|
|
|
|
Because the VM is very barebones at the moment, and no language is written on top of it, I created a very simple assembly
|
|
language. Here's an example of computing a factorial function in it:
|
|
|
|
```
|
|
li r1, 10
|
|
li r2, 1
|
|
li r0, 1
|
|
|
|
factorial:
|
|
mul r0, r0, r2
|
|
addi r2, r2, 1
|
|
jle r2, r1, factorial
|
|
```
|
|
|
|
This code computes `10!` and returns it in the `r0` register.
|
|
|
|
The architecture of the VM is "load/store", meaning that computation (addition, multiplication, conditions, etc...) can only
|
|
be performed on registers. Data can be loaded from the memory to registers with separate instructions. This contrasts a bit
|
|
with how some VMs are implemented: for some reason many of them don't use registers at all, and instead rely only on the stack.
|
|
For example, this is what a factorial function would look like in Python bytecode:
|
|
|
|
```
|
|
2 >> 0 LOAD_FAST 0 (N)
|
|
3 LOAD_CONST 1 (1)
|
|
6 COMPARE_OP 2 (==)
|
|
9 POP_JUMP_IF_FALSE 16
|
|
|
|
3 12 LOAD_FAST 1 (result)
|
|
15 RETURN_VALUE
|
|
|
|
4 >> 16 LOAD_FAST 0 (N)
|
|
19 LOAD_CONST 1 (1)
|
|
22 BINARY_SUBTRACT
|
|
23 LOAD_FAST 0 (N)
|
|
26 LOAD_FAST 1 (result)
|
|
29 BINARY_MULTIPLY
|
|
30 STORE_FAST 1 (result)
|
|
33 STORE_FAST 0 (N)
|
|
36 JUMP_ABSOLUTE 0
|
|
```
|
|
|
|
If you look carefully, you'd notice that there are no registers here. This is because all operations that write something,
|
|
usually do so to the top of the stack.
|
|
|
|
There are a few problems I find with the stack machines:
|
|
|
|
- It is unnatural to read the disassembly (you have to keep track of changing stack offsets instead of register names)
|
|
- Either instructions waste space, or we deal with variable-widths instructions (as is the case for Python)
|
|
- Some potential for optimizations is wasted
|
|
|
|
In my personal opinion, a good virtual machine for a dynamic language should be also a suitable target for compiling regular
|
|
expression state machines.
|
|
|
|
So instead, I opted for a more traditional approach, that is similar to a RISC CPU:
|
|
|
|
- 32-bit constant-width instructions
|
|
- flexible stack
|
|
- 32 registers, most of which are general-purpose, except for frame pointer/instruction pointer/etc...
|
|
- 64-bit width for both register and stack entries
|
|
|
|
The only twist that I've added compared to the "normal" CPUs is register and stack tagging. With physical processors,
|
|
it is often the case that software is written in strongly-typed languages, where data types are known during compile time,
|
|
and thus the compiler can generate specific instructions for handling, say, `int32` vs `int64`.
|
|
|
|
Consider the following code:
|
|
|
|
```
|
|
mul r3, r2, r1
|
|
```
|
|
|
|
It is essentially equivalent to `r3 = r2 * r1`. But what types do `r1` and `r2` have? Well, in case of my virtual machine,
|
|
registers and stack entries "know" their type. So if you attempt to multiply an `int32` with `uint64`, you'd get standard
|
|
type promotion, and the result would be tagged as `uint64`. Because of this, you don't have to perform checks on the bytecode
|
|
level.
|
|
|
|
So, how fast is this approach? In my preliminary tests, a simple loop from 1 to 100 million, with a multiplication inside,
|
|
takes 0.7 seconds to complete. Which is plenty fast, considering that the VM implementation is naive and has never been
|
|
seriously optimized.
|
|
|
|
The code for the experimental version can be found [here](https://git.knazarov.com/knazarov/valeri).
|