From 3ef359eeb6155791266bbd33fd512aad0356b02c Mon Sep 17 00:00:00 2001 From: Konstantin Nazarov Date: Fri, 28 Jul 2023 22:16:09 +0100 Subject: [PATCH] Add a post about a dynamic language VM --- .../posts/simple_dynamic_language_vm/note.md | 92 +++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 content/posts/simple_dynamic_language_vm/note.md diff --git a/content/posts/simple_dynamic_language_vm/note.md b/content/posts/simple_dynamic_language_vm/note.md new file mode 100644 index 0000000..8a7c8e6 --- /dev/null +++ b/content/posts/simple_dynamic_language_vm/note.md @@ -0,0 +1,92 @@ +X-Date: 2023-07-28T21:00:00Z +X-Note-Id: 67c55b15-462a-4bfd-8b6c-277535615938 +Subject: Simple VM for dynamic languages +X-Slug: simple_dynamic_language_vm + +A few weeks ago, I've started working on a lisp interpreter. I already did a few implementations of lisp in different +languages, but those were mostly just recursive evaluators. This time, it's a bit more serious. + +Instead of writing the implementation top-to-bottom, I started with a virtual machine. Virtual machines are used to +execute most of the scripting languages, since bytecode is more compact and faster to evaluate. Compared to a tree-walker, +bytecode VMs are better for branch prediction and friendlier to the CPU cache. + +Because the VM is very barebones at the moment, and no language is written on top of it, I created a very simple assembly +language. Here's an example of computing a factorial function in it: + +``` +li r1, 10 +li r2, 1 +li r0, 1 + +factorial: + mul r0, r0, r2 + addi r2, r2, 1 + jle r2, r1, factorial +``` + +This code computes `10!` and returns it in the `r0` register. + +The architecture of the VM is "load/store", meaning that computation (addition, multiplication, conditions, etc...) can only +be performed on registers. Data can be loaded from the memory to registers with separate instructions. This contrasts a bit +with how some VMs are implemented: for some reason many of them don't use registers at all, and instead rely only on the stack. +For example, this is what a factorial function would look like in Python bytecode: + +``` + 2 >> 0 LOAD_FAST 0 (N) + 3 LOAD_CONST 1 (1) + 6 COMPARE_OP 2 (==) + 9 POP_JUMP_IF_FALSE 16 + + 3 12 LOAD_FAST 1 (result) + 15 RETURN_VALUE + + 4 >> 16 LOAD_FAST 0 (N) + 19 LOAD_CONST 1 (1) + 22 BINARY_SUBTRACT + 23 LOAD_FAST 0 (N) + 26 LOAD_FAST 1 (result) + 29 BINARY_MULTIPLY + 30 STORE_FAST 1 (result) + 33 STORE_FAST 0 (N) + 36 JUMP_ABSOLUTE 0 +``` + +If you look carefully, you'd notice that there are no registers here. This is because all operations that write something, +usually do so to the top of the stack. + +There are a few problems I find with the stack machines: + +- It is unnatural to read the disassembly (you have to keep track of changing stack offsets instead of register names) +- Either instructions waste space, or we deal with variable-widths instructions (as is the case for Python) +- Some potential for optimizations is wasted + +In my personal opinion, a good virtual machine for a dynamic language should be also a suitable target for compiling regular +expression state machines. + +So instead, I opted for a more traditional approach, that is similar to a RISC CPU: + +- 32-bit constant-width instructions +- flexible stack +- 32 registers, most of which are general-purpose, except for frame pointer/instruction pointer/etc... +- 64-bit width for both register and stack entries + +The only twist that I've added compared to the "normal" CPUs is register and stack tagging. With physical processors, +it is often the case that software is written in strongly-typed languages, where data types are known during compile time, +and thus the compiler can generate specific instructions for handling, say, `int32` vs `int64`. + +Consider the following code: + +``` +mul r3, r2, r1 +``` + +It is essentially equivalent to `r3 = r2 * r1`. But what types do `r1` and `r2` have? Well, in case of my virtual machine, +registers and stack entries "know" their type. So if you attempt to multiply an `int32` with `uint64`, you'd get standard +type promotion, and the result would be tagged as `uint64`. Because of this, you don't have to perform checks on the bytecode +level. + +So, how fast is this approach? In my preliminary tests, a simple loop from 1 to 100 million, with a multiplication inside, +takes 0.7 seconds to complete. Which is plenty fast, considering that the VM implementation is naive and has never been +seriously optimized. + +The code for the experimental version can be found [here](https://git.sr.ht/~knazarov/lisp.experimental).