Publish a post on frozen data types in the VM

2023-08-05 22:33:58 +01:00 · 2023-08-05 22:33:58 +01:00 · 8b2335f3f1
commit 8b2335f3f1
parent 99715f69cb
1 changed files with 84 additions and 0 deletions
--- a/content/posts/vm_progress_update_frozen_data_types/note.md
+++ b/content/posts/vm_progress_update_frozen_data_types/note.md
@ -0,0 +1,84 @@
 X-Date: 2023-08-05T21:16:00Z
 X-Note-Id: cc2f05b8-6a95-452a-875b-887fde269c35
 Subject: VM progress update: frozen data types
 X-Slug: vm_progress_update_frozen_data_types
 One of the things I want from the virtual machine runtime is to be able to easily work with serialized data.
 Usually in typical programming languages, what you send over the network or write to files is different from the
 in-memory representation. The reason for this is simple: your CPU may have different
 [endianness](https://en.wikipedia.org/wiki/Endianness), so the way your program represents numbers will differ
 between, say, x86 and arm. So, the serialized form's binary encoding is usually "universal".
 Additionally, serialized representation is usually sequentially packed. So, people invented universal serialization
 formats such as [Protocol Buffers](https://github.com/protocolbuffers/protobuf), [Thrift](https://thrift.apache.org/),
 or XML/JSON (these are text ones for human readability).
 I specifically design the runtime to be "shared-nothing", where separate worker threads cannot share any state, and only
 communicate through message passing. These messages would be sent either within the same process, or across network/IPC.
 So they have to be serialized at least to some degree.
 If I approach this in a "standard" way, I'd have to have explicit deserialization functions that receive incoming data,
 call a function to unpack it to a set of native data structures, and then work with those. But there's actually a better
 way: the approach taken by [Cap’n Proto](https://capnproto.org/) completely evades the need to have any type of deserialization.
 The data structures it receives can just be directly mapped into memory, and language runtime then figures out how to access
 individual fields.
 The binary format of Cap'n Proto is very slightly inefficient, because it needs more space than Thrift or Protocol Buffers.
 But it makes up for that by being very fast. And being fast is exactly what I need for cross-thread interaction.
 And I can probably also mitigate the second problem in Cap'n Proto: the need for code generation. Because Cap'n Proto is
 language-agnostic, it has to provide code generators that would wrap the received data structures and allow access to their
 fields through high-level object interface. But since I'm writing my own runtime, I can just build all required data structures
 into the runtime.
 And this is what I would call "frozen data structures". From the point of view of the VM, they look like normal data structures,
 except they are laid out in memory sequentially in one block, and you can't mutate them. Frozen data structures can only have
 pointers to the same contiguous frozen memory space, and those pointers are not absolute, but just integer offsets.
 For example, here's some pseudocode that illustrates this:
 ```
 # Somewhere in thread1
 dict = {"foo": "bar", "baz": [5,6,7,8]}
 frozen_dict = freeze(dict)
 assert frozen_dict["baz"][2] == 7
 send(thread2, frozen_dict)
 # Somewhere in thread2
 frozen_dict = receive()
 assert frozen_dict["baz"][2] == 7
 ```
 When `freeze()` is called on a data structure, the VM walks the data structure and returns its representation
 as a contiguous immutable block. This block can then be worked with in the same way as regular data structures.
 The second thread calls `receive()` and gets the frozen data structure, that's been copied into its memory.
 Ideally, this is as simple as a `memcpy()` plus a few bounds checks to see that nothing in that data structure
 points outside of its memory region. Then we can continue to work with it normally.
 Now, a few words on how it is represented in the virtual machine right now. The virtual machine mostly has primitive
 data types such as various types of integers (both signed and unsigned) and arrays. The arrays have both mutable and
 immutable representation, so when a VM holds a pointer to the array, it can determine whether mutation is prohibited.
 The more interesting part is pointers: as I was saying above, the objects laid out in a contiguous region, only
 point to each other with integer offsets. So you can't directly point to the inside of the region (at least because
 of garbage collection issues that I would explain in a separate post).
 So, the pointers within the VM are represented roughly as:
 ```
 struct pointer_t {
  uint8_t tag;
  uint64_t offset : 56; // Use only 56 bits to fit the whole pointer into 16 bytes
  void* ptr; // Points to the beginning of contiguous memory block
 }
 ```
 This allows to assign pointers to the registers, and do all normal operations with them. For example, load/store
 operations would check if the pointer tag is related to the frozen data type and if so, the resulting object would
 also be returned as a "frozen pointer", just with an adjusted offset from the start of the region.
 So far, I've implemented most of the required data structures, but haven't finished the access layer completely,
 so no transparent load/stores yet. Stay tuned for the updates.
 The code for the experimental version can be found [here](https://git.sr.ht/~knazarov/lisp.experimental).