From 8b2335f3f17569352baa2c62644faa73815dc18f Mon Sep 17 00:00:00 2001
From: Konstantin Nazarov <mail@knazarov.com>
Date: Sat, 5 Aug 2023 22:33:58 +0100
Subject: [PATCH] Publish a post on frozen data types in the VM

---
 .../note.md                                   | 84 +++++++++++++++++++
 1 file changed, 84 insertions(+)
 create mode 100644 content/posts/vm_progress_update_frozen_data_types/note.md

diff --git a/content/posts/vm_progress_update_frozen_data_types/note.md b/content/posts/vm_progress_update_frozen_data_types/note.md
new file mode 100644
index 0000000..fe90315
--- /dev/null
+++ b/content/posts/vm_progress_update_frozen_data_types/note.md
@@ -0,0 +1,84 @@
+X-Date: 2023-08-05T21:16:00Z
+X-Note-Id: cc2f05b8-6a95-452a-875b-887fde269c35
+Subject: VM progress update: frozen data types
+X-Slug: vm_progress_update_frozen_data_types
+
+One of the things I want from the virtual machine runtime is to be able to easily work with serialized data.
+Usually in typical programming languages, what you send over the network or write to files is different from the
+in-memory representation. The reason for this is simple: your CPU may have different
+[endianness](https://en.wikipedia.org/wiki/Endianness), so the way your program represents numbers will differ
+between, say, x86 and arm. So, the serialized form's binary encoding is usually "universal".
+
+Additionally, serialized representation is usually sequentially packed. So, people invented universal serialization
+formats such as [Protocol Buffers](https://github.com/protocolbuffers/protobuf), [Thrift](https://thrift.apache.org/),
+or XML/JSON (these are text ones for human readability).
+
+I specifically design the runtime to be "shared-nothing", where separate worker threads cannot share any state, and only
+communicate through message passing. These messages would be sent either within the same process, or across network/IPC.
+So they have to be serialized at least to some degree.
+
+If I approach this in a "standard" way, I'd have to have explicit deserialization functions that receive incoming data,
+call a function to unpack it to a set of native data structures, and then work with those. But there's actually a better
+way: the approach taken by [Cap’n Proto](https://capnproto.org/) completely evades the need to have any type of deserialization.
+The data structures it receives can just be directly mapped into memory, and language runtime then figures out how to access
+individual fields.
+
+The binary format of Cap'n Proto is very slightly inefficient, because it needs more space than Thrift or Protocol Buffers.
+But it makes up for that by being very fast. And being fast is exactly what I need for cross-thread interaction.
+
+And I can probably also mitigate the second problem in Cap'n Proto: the need for code generation. Because Cap'n Proto is
+language-agnostic, it has to provide code generators that would wrap the received data structures and allow access to their
+fields through high-level object interface. But since I'm writing my own runtime, I can just build all required data structures
+into the runtime.
+
+And this is what I would call "frozen data structures". From the point of view of the VM, they look like normal data structures,
+except they are laid out in memory sequentially in one block, and you can't mutate them. Frozen data structures can only have
+pointers to the same contiguous frozen memory space, and those pointers are not absolute, but just integer offsets.
+
+For example, here's some pseudocode that illustrates this:
+
+```
+# Somewhere in thread1
+dict = {"foo": "bar", "baz": [5,6,7,8]}
+frozen_dict = freeze(dict)
+assert frozen_dict["baz"][2] == 7
+send(thread2, frozen_dict)
+
+# Somewhere in thread2
+frozen_dict = receive()
+assert frozen_dict["baz"][2] == 7
+```
+
+When `freeze()` is called on a data structure, the VM walks the data structure and returns its representation
+as a contiguous immutable block. This block can then be worked with in the same way as regular data structures.
+
+The second thread calls `receive()` and gets the frozen data structure, that's been copied into its memory.
+Ideally, this is as simple as a `memcpy()` plus a few bounds checks to see that nothing in that data structure
+points outside of its memory region. Then we can continue to work with it normally.
+
+Now, a few words on how it is represented in the virtual machine right now. The virtual machine mostly has primitive
+data types such as various types of integers (both signed and unsigned) and arrays. The arrays have both mutable and
+immutable representation, so when a VM holds a pointer to the array, it can determine whether mutation is prohibited.
+
+The more interesting part is pointers: as I was saying above, the objects laid out in a contiguous region, only
+point to each other with integer offsets. So you can't directly point to the inside of the region (at least because
+of garbage collection issues that I would explain in a separate post).
+
+So, the pointers within the VM are represented roughly as:
+
+```
+struct pointer_t {
+  uint8_t tag;
+  uint64_t offset : 56; // Use only 56 bits to fit the whole pointer into 16 bytes
+  void* ptr; // Points to the beginning of contiguous memory block
+}
+```
+
+This allows to assign pointers to the registers, and do all normal operations with them. For example, load/store
+operations would check if the pointer tag is related to the frozen data type and if so, the resulting object would
+also be returned as a "frozen pointer", just with an adjusted offset from the start of the region.
+
+So far, I've implemented most of the required data structures, but haven't finished the access layer completely,
+so no transparent load/stores yet. Stay tuned for the updates.
+
+The code for the experimental version can be found [here](https://git.sr.ht/~knazarov/lisp.experimental).