Add a post about a dynamic language VM

2023-07-28 22:16:09 +01:00 · 2023-07-28 22:16:09 +01:00 · 3ef359eeb6
commit 3ef359eeb6
parent bb0f078035
1 changed files with 92 additions and 0 deletions
--- a/content/posts/simple_dynamic_language_vm/note.md
+++ b/content/posts/simple_dynamic_language_vm/note.md
@ -0,0 +1,92 @@
+X-Date: 2023-07-28T21:00:00Z
+X-Note-Id: 67c55b15-462a-4bfd-8b6c-277535615938
+Subject: Simple VM for dynamic languages
+X-Slug: simple_dynamic_language_vm
+
+A few weeks ago, I've started working on a lisp interpreter. I already did a few implementations of lisp in different
+languages, but those were mostly just recursive evaluators. This time, it's a bit more serious.
+
+Instead of writing the implementation top-to-bottom, I started with a virtual machine. Virtual machines are used to
+execute most of the scripting languages, since bytecode is more compact and faster to evaluate. Compared to a tree-walker,
+bytecode VMs are better for branch prediction and friendlier to the CPU cache.
+
+Because the VM is very barebones at the moment, and no language is written on top of it, I created a very simple assembly
+language. Here's an example of computing a factorial function in it:
+
+```
+li r1, 10
+li r2, 1
+li r0, 1
+
+factorial:
+    mul r0, r0, r2
+    addi r2, r2, 1
+    jle r2, r1, factorial
+```
+
+This code computes `10!` and returns it in the `r0` register.
+
+The architecture of the VM is "load/store", meaning that computation (addition, multiplication, conditions, etc...) can only
+be performed on registers. Data can be loaded from the memory to registers with separate instructions. This contrasts a bit
+with how some VMs are implemented: for some reason many of them don't use registers at all, and instead rely only on the stack.
+For example, this is what a factorial function would look like in Python bytecode:
+
+```
+  2     >>    0 LOAD_FAST                0 (N)
+              3 LOAD_CONST               1 (1)
+              6 COMPARE_OP               2 (==)
+              9 POP_JUMP_IF_FALSE       16
+
+  3          12 LOAD_FAST               1 (result)
+             15 RETURN_VALUE
+
+  4     >>   16 LOAD_FAST                0 (N)
+             19 LOAD_CONST               1 (1)
+             22 BINARY_SUBTRACT
+             23 LOAD_FAST                0 (N)
+             26 LOAD_FAST                1 (result)
+             29 BINARY_MULTIPLY
+             30 STORE_FAST               1 (result)
+             33 STORE_FAST               0 (N)
+             36 JUMP_ABSOLUTE            0
+```
+
+If you look carefully, you'd notice that there are no registers here. This is because all operations that write something,
+usually do so to the top of the stack.
+
+There are a few problems I find with the stack machines:
+
+- It is unnatural to read the disassembly (you have to keep track of changing stack offsets instead of register names)
+- Either instructions waste space, or we deal with variable-widths instructions (as is the case for Python)
+- Some potential for optimizations is wasted
+
+In my personal opinion, a good virtual machine for a dynamic language should be also a suitable target for compiling regular
+expression state machines.
+
+So instead, I opted for a more traditional approach, that is similar to a RISC CPU:
+
+- 32-bit constant-width instructions
+- flexible stack
+- 32 registers, most of which are general-purpose, except for frame pointer/instruction pointer/etc...
+- 64-bit width for both register and stack entries
+
+The only twist that I've added compared to the "normal" CPUs is register and stack tagging. With physical processors,
+it is often the case that software is written in strongly-typed languages, where data types are known during compile time,
+and thus the compiler can generate specific instructions for handling, say, `int32` vs `int64`.
+
+Consider the following code:
+
+```
+mul r3, r2, r1
+```
+
+It is essentially equivalent to `r3 = r2 * r1`. But what types do `r1` and `r2` have? Well, in case of my virtual machine,
+registers and stack entries "know" their type. So if you attempt to multiply an `int32` with `uint64`, you'd get standard
+type promotion, and the result would be tagged as `uint64`. Because of this, you don't have to perform checks on the bytecode
+level.
+
+So, how fast is this approach? In my preliminary tests, a simple loop from 1 to 100 million, with a multiplication inside,
+takes 0.7 seconds to complete. Which is plenty fast, considering that the VM implementation is naive and has never been
+seriously optimized.
+
+The code for the experimental version can be found [here](https://git.sr.ht/~knazarov/lisp.experimental).