From 8b2335f3f17569352baa2c62644faa73815dc18f Mon Sep 17 00:00:00 2001 From: Konstantin Nazarov Date: Sat, 5 Aug 2023 22:33:58 +0100 Subject: [PATCH] Publish a post on frozen data types in the VM --- .../note.md | 84 +++++++++++++++++++ 1 file changed, 84 insertions(+) create mode 100644 content/posts/vm_progress_update_frozen_data_types/note.md diff --git a/content/posts/vm_progress_update_frozen_data_types/note.md b/content/posts/vm_progress_update_frozen_data_types/note.md new file mode 100644 index 0000000..fe90315 --- /dev/null +++ b/content/posts/vm_progress_update_frozen_data_types/note.md @@ -0,0 +1,84 @@ +X-Date: 2023-08-05T21:16:00Z +X-Note-Id: cc2f05b8-6a95-452a-875b-887fde269c35 +Subject: VM progress update: frozen data types +X-Slug: vm_progress_update_frozen_data_types + +One of the things I want from the virtual machine runtime is to be able to easily work with serialized data. +Usually in typical programming languages, what you send over the network or write to files is different from the +in-memory representation. The reason for this is simple: your CPU may have different +[endianness](https://en.wikipedia.org/wiki/Endianness), so the way your program represents numbers will differ +between, say, x86 and arm. So, the serialized form's binary encoding is usually "universal". + +Additionally, serialized representation is usually sequentially packed. So, people invented universal serialization +formats such as [Protocol Buffers](https://github.com/protocolbuffers/protobuf), [Thrift](https://thrift.apache.org/), +or XML/JSON (these are text ones for human readability). + +I specifically design the runtime to be "shared-nothing", where separate worker threads cannot share any state, and only +communicate through message passing. These messages would be sent either within the same process, or across network/IPC. +So they have to be serialized at least to some degree. + +If I approach this in a "standard" way, I'd have to have explicit deserialization functions that receive incoming data, +call a function to unpack it to a set of native data structures, and then work with those. But there's actually a better +way: the approach taken by [Cap’n Proto](https://capnproto.org/) completely evades the need to have any type of deserialization. +The data structures it receives can just be directly mapped into memory, and language runtime then figures out how to access +individual fields. + +The binary format of Cap'n Proto is very slightly inefficient, because it needs more space than Thrift or Protocol Buffers. +But it makes up for that by being very fast. And being fast is exactly what I need for cross-thread interaction. + +And I can probably also mitigate the second problem in Cap'n Proto: the need for code generation. Because Cap'n Proto is +language-agnostic, it has to provide code generators that would wrap the received data structures and allow access to their +fields through high-level object interface. But since I'm writing my own runtime, I can just build all required data structures +into the runtime. + +And this is what I would call "frozen data structures". From the point of view of the VM, they look like normal data structures, +except they are laid out in memory sequentially in one block, and you can't mutate them. Frozen data structures can only have +pointers to the same contiguous frozen memory space, and those pointers are not absolute, but just integer offsets. + +For example, here's some pseudocode that illustrates this: + +``` +# Somewhere in thread1 +dict = {"foo": "bar", "baz": [5,6,7,8]} +frozen_dict = freeze(dict) +assert frozen_dict["baz"][2] == 7 +send(thread2, frozen_dict) + +# Somewhere in thread2 +frozen_dict = receive() +assert frozen_dict["baz"][2] == 7 +``` + +When `freeze()` is called on a data structure, the VM walks the data structure and returns its representation +as a contiguous immutable block. This block can then be worked with in the same way as regular data structures. + +The second thread calls `receive()` and gets the frozen data structure, that's been copied into its memory. +Ideally, this is as simple as a `memcpy()` plus a few bounds checks to see that nothing in that data structure +points outside of its memory region. Then we can continue to work with it normally. + +Now, a few words on how it is represented in the virtual machine right now. The virtual machine mostly has primitive +data types such as various types of integers (both signed and unsigned) and arrays. The arrays have both mutable and +immutable representation, so when a VM holds a pointer to the array, it can determine whether mutation is prohibited. + +The more interesting part is pointers: as I was saying above, the objects laid out in a contiguous region, only +point to each other with integer offsets. So you can't directly point to the inside of the region (at least because +of garbage collection issues that I would explain in a separate post). + +So, the pointers within the VM are represented roughly as: + +``` +struct pointer_t { + uint8_t tag; + uint64_t offset : 56; // Use only 56 bits to fit the whole pointer into 16 bytes + void* ptr; // Points to the beginning of contiguous memory block +} +``` + +This allows to assign pointers to the registers, and do all normal operations with them. For example, load/store +operations would check if the pointer tag is related to the frozen data type and if so, the resulting object would +also be returned as a "frozen pointer", just with an adjusted offset from the start of the region. + +So far, I've implemented most of the required data structures, but haven't finished the access layer completely, +so no transparent load/stores yet. Stay tuned for the updates. + +The code for the experimental version can be found [here](https://git.sr.ht/~knazarov/lisp.experimental).