Publish a post on frozen data types in the VM
This commit is contained in:
parent
99715f69cb
commit
8b2335f3f1
1 changed files with 84 additions and 0 deletions
84
content/posts/vm_progress_update_frozen_data_types/note.md
Normal file
84
content/posts/vm_progress_update_frozen_data_types/note.md
Normal file
|
@ -0,0 +1,84 @@
|
||||||
|
X-Date: 2023-08-05T21:16:00Z
|
||||||
|
X-Note-Id: cc2f05b8-6a95-452a-875b-887fde269c35
|
||||||
|
Subject: VM progress update: frozen data types
|
||||||
|
X-Slug: vm_progress_update_frozen_data_types
|
||||||
|
|
||||||
|
One of the things I want from the virtual machine runtime is to be able to easily work with serialized data.
|
||||||
|
Usually in typical programming languages, what you send over the network or write to files is different from the
|
||||||
|
in-memory representation. The reason for this is simple: your CPU may have different
|
||||||
|
[endianness](https://en.wikipedia.org/wiki/Endianness), so the way your program represents numbers will differ
|
||||||
|
between, say, x86 and arm. So, the serialized form's binary encoding is usually "universal".
|
||||||
|
|
||||||
|
Additionally, serialized representation is usually sequentially packed. So, people invented universal serialization
|
||||||
|
formats such as [Protocol Buffers](https://github.com/protocolbuffers/protobuf), [Thrift](https://thrift.apache.org/),
|
||||||
|
or XML/JSON (these are text ones for human readability).
|
||||||
|
|
||||||
|
I specifically design the runtime to be "shared-nothing", where separate worker threads cannot share any state, and only
|
||||||
|
communicate through message passing. These messages would be sent either within the same process, or across network/IPC.
|
||||||
|
So they have to be serialized at least to some degree.
|
||||||
|
|
||||||
|
If I approach this in a "standard" way, I'd have to have explicit deserialization functions that receive incoming data,
|
||||||
|
call a function to unpack it to a set of native data structures, and then work with those. But there's actually a better
|
||||||
|
way: the approach taken by [Cap’n Proto](https://capnproto.org/) completely evades the need to have any type of deserialization.
|
||||||
|
The data structures it receives can just be directly mapped into memory, and language runtime then figures out how to access
|
||||||
|
individual fields.
|
||||||
|
|
||||||
|
The binary format of Cap'n Proto is very slightly inefficient, because it needs more space than Thrift or Protocol Buffers.
|
||||||
|
But it makes up for that by being very fast. And being fast is exactly what I need for cross-thread interaction.
|
||||||
|
|
||||||
|
And I can probably also mitigate the second problem in Cap'n Proto: the need for code generation. Because Cap'n Proto is
|
||||||
|
language-agnostic, it has to provide code generators that would wrap the received data structures and allow access to their
|
||||||
|
fields through high-level object interface. But since I'm writing my own runtime, I can just build all required data structures
|
||||||
|
into the runtime.
|
||||||
|
|
||||||
|
And this is what I would call "frozen data structures". From the point of view of the VM, they look like normal data structures,
|
||||||
|
except they are laid out in memory sequentially in one block, and you can't mutate them. Frozen data structures can only have
|
||||||
|
pointers to the same contiguous frozen memory space, and those pointers are not absolute, but just integer offsets.
|
||||||
|
|
||||||
|
For example, here's some pseudocode that illustrates this:
|
||||||
|
|
||||||
|
```
|
||||||
|
# Somewhere in thread1
|
||||||
|
dict = {"foo": "bar", "baz": [5,6,7,8]}
|
||||||
|
frozen_dict = freeze(dict)
|
||||||
|
assert frozen_dict["baz"][2] == 7
|
||||||
|
send(thread2, frozen_dict)
|
||||||
|
|
||||||
|
# Somewhere in thread2
|
||||||
|
frozen_dict = receive()
|
||||||
|
assert frozen_dict["baz"][2] == 7
|
||||||
|
```
|
||||||
|
|
||||||
|
When `freeze()` is called on a data structure, the VM walks the data structure and returns its representation
|
||||||
|
as a contiguous immutable block. This block can then be worked with in the same way as regular data structures.
|
||||||
|
|
||||||
|
The second thread calls `receive()` and gets the frozen data structure, that's been copied into its memory.
|
||||||
|
Ideally, this is as simple as a `memcpy()` plus a few bounds checks to see that nothing in that data structure
|
||||||
|
points outside of its memory region. Then we can continue to work with it normally.
|
||||||
|
|
||||||
|
Now, a few words on how it is represented in the virtual machine right now. The virtual machine mostly has primitive
|
||||||
|
data types such as various types of integers (both signed and unsigned) and arrays. The arrays have both mutable and
|
||||||
|
immutable representation, so when a VM holds a pointer to the array, it can determine whether mutation is prohibited.
|
||||||
|
|
||||||
|
The more interesting part is pointers: as I was saying above, the objects laid out in a contiguous region, only
|
||||||
|
point to each other with integer offsets. So you can't directly point to the inside of the region (at least because
|
||||||
|
of garbage collection issues that I would explain in a separate post).
|
||||||
|
|
||||||
|
So, the pointers within the VM are represented roughly as:
|
||||||
|
|
||||||
|
```
|
||||||
|
struct pointer_t {
|
||||||
|
uint8_t tag;
|
||||||
|
uint64_t offset : 56; // Use only 56 bits to fit the whole pointer into 16 bytes
|
||||||
|
void* ptr; // Points to the beginning of contiguous memory block
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This allows to assign pointers to the registers, and do all normal operations with them. For example, load/store
|
||||||
|
operations would check if the pointer tag is related to the frozen data type and if so, the resulting object would
|
||||||
|
also be returned as a "frozen pointer", just with an adjusted offset from the start of the region.
|
||||||
|
|
||||||
|
So far, I've implemented most of the required data structures, but haven't finished the access layer completely,
|
||||||
|
so no transparent load/stores yet. Stay tuned for the updates.
|
||||||
|
|
||||||
|
The code for the experimental version can be found [here](https://git.sr.ht/~knazarov/lisp.experimental).
|
Loading…
Reference in a new issue