Publish a post on frozen data types in the VM

This commit is contained in:
Konstantin Nazarov 2023-08-05 22:33:58 +01:00
parent 99715f69cb
commit 8b2335f3f1
Signed by: knazarov
GPG key ID: 4CFE0A42FA409C22

View file

@ -0,0 +1,84 @@
X-Date: 2023-08-05T21:16:00Z
X-Note-Id: cc2f05b8-6a95-452a-875b-887fde269c35
Subject: VM progress update: frozen data types
X-Slug: vm_progress_update_frozen_data_types
One of the things I want from the virtual machine runtime is to be able to easily work with serialized data.
Usually in typical programming languages, what you send over the network or write to files is different from the
in-memory representation. The reason for this is simple: your CPU may have different
[endianness](https://en.wikipedia.org/wiki/Endianness), so the way your program represents numbers will differ
between, say, x86 and arm. So, the serialized form's binary encoding is usually "universal".
Additionally, serialized representation is usually sequentially packed. So, people invented universal serialization
formats such as [Protocol Buffers](https://github.com/protocolbuffers/protobuf), [Thrift](https://thrift.apache.org/),
or XML/JSON (these are text ones for human readability).
I specifically design the runtime to be "shared-nothing", where separate worker threads cannot share any state, and only
communicate through message passing. These messages would be sent either within the same process, or across network/IPC.
So they have to be serialized at least to some degree.
If I approach this in a "standard" way, I'd have to have explicit deserialization functions that receive incoming data,
call a function to unpack it to a set of native data structures, and then work with those. But there's actually a better
way: the approach taken by [Capn Proto](https://capnproto.org/) completely evades the need to have any type of deserialization.
The data structures it receives can just be directly mapped into memory, and language runtime then figures out how to access
individual fields.
The binary format of Cap'n Proto is very slightly inefficient, because it needs more space than Thrift or Protocol Buffers.
But it makes up for that by being very fast. And being fast is exactly what I need for cross-thread interaction.
And I can probably also mitigate the second problem in Cap'n Proto: the need for code generation. Because Cap'n Proto is
language-agnostic, it has to provide code generators that would wrap the received data structures and allow access to their
fields through high-level object interface. But since I'm writing my own runtime, I can just build all required data structures
into the runtime.
And this is what I would call "frozen data structures". From the point of view of the VM, they look like normal data structures,
except they are laid out in memory sequentially in one block, and you can't mutate them. Frozen data structures can only have
pointers to the same contiguous frozen memory space, and those pointers are not absolute, but just integer offsets.
For example, here's some pseudocode that illustrates this:
```
# Somewhere in thread1
dict = {"foo": "bar", "baz": [5,6,7,8]}
frozen_dict = freeze(dict)
assert frozen_dict["baz"][2] == 7
send(thread2, frozen_dict)
# Somewhere in thread2
frozen_dict = receive()
assert frozen_dict["baz"][2] == 7
```
When `freeze()` is called on a data structure, the VM walks the data structure and returns its representation
as a contiguous immutable block. This block can then be worked with in the same way as regular data structures.
The second thread calls `receive()` and gets the frozen data structure, that's been copied into its memory.
Ideally, this is as simple as a `memcpy()` plus a few bounds checks to see that nothing in that data structure
points outside of its memory region. Then we can continue to work with it normally.
Now, a few words on how it is represented in the virtual machine right now. The virtual machine mostly has primitive
data types such as various types of integers (both signed and unsigned) and arrays. The arrays have both mutable and
immutable representation, so when a VM holds a pointer to the array, it can determine whether mutation is prohibited.
The more interesting part is pointers: as I was saying above, the objects laid out in a contiguous region, only
point to each other with integer offsets. So you can't directly point to the inside of the region (at least because
of garbage collection issues that I would explain in a separate post).
So, the pointers within the VM are represented roughly as:
```
struct pointer_t {
uint8_t tag;
uint64_t offset : 56; // Use only 56 bits to fit the whole pointer into 16 bytes
void* ptr; // Points to the beginning of contiguous memory block
}
```
This allows to assign pointers to the registers, and do all normal operations with them. For example, load/store
operations would check if the pointer tag is related to the frozen data type and if so, the resulting object would
also be returned as a "frozen pointer", just with an adjusted offset from the start of the region.
So far, I've implemented most of the required data structures, but haven't finished the access layer completely,
so no transparent load/stores yet. Stay tuned for the updates.
The code for the experimental version can be found [here](https://git.sr.ht/~knazarov/lisp.experimental).