Publish a post on frozen data types in the VM
This commit is contained in:
parent
99715f69cb
commit
8b2335f3f1
1 changed files with 84 additions and 0 deletions
84
content/posts/vm_progress_update_frozen_data_types/note.md
Normal file
84
content/posts/vm_progress_update_frozen_data_types/note.md
Normal file
|
@ -0,0 +1,84 @@
|
|||
X-Date: 2023-08-05T21:16:00Z
|
||||
X-Note-Id: cc2f05b8-6a95-452a-875b-887fde269c35
|
||||
Subject: VM progress update: frozen data types
|
||||
X-Slug: vm_progress_update_frozen_data_types
|
||||
|
||||
One of the things I want from the virtual machine runtime is to be able to easily work with serialized data.
|
||||
Usually in typical programming languages, what you send over the network or write to files is different from the
|
||||
in-memory representation. The reason for this is simple: your CPU may have different
|
||||
[endianness](https://en.wikipedia.org/wiki/Endianness), so the way your program represents numbers will differ
|
||||
between, say, x86 and arm. So, the serialized form's binary encoding is usually "universal".
|
||||
|
||||
Additionally, serialized representation is usually sequentially packed. So, people invented universal serialization
|
||||
formats such as [Protocol Buffers](https://github.com/protocolbuffers/protobuf), [Thrift](https://thrift.apache.org/),
|
||||
or XML/JSON (these are text ones for human readability).
|
||||
|
||||
I specifically design the runtime to be "shared-nothing", where separate worker threads cannot share any state, and only
|
||||
communicate through message passing. These messages would be sent either within the same process, or across network/IPC.
|
||||
So they have to be serialized at least to some degree.
|
||||
|
||||
If I approach this in a "standard" way, I'd have to have explicit deserialization functions that receive incoming data,
|
||||
call a function to unpack it to a set of native data structures, and then work with those. But there's actually a better
|
||||
way: the approach taken by [Cap’n Proto](https://capnproto.org/) completely evades the need to have any type of deserialization.
|
||||
The data structures it receives can just be directly mapped into memory, and language runtime then figures out how to access
|
||||
individual fields.
|
||||
|
||||
The binary format of Cap'n Proto is very slightly inefficient, because it needs more space than Thrift or Protocol Buffers.
|
||||
But it makes up for that by being very fast. And being fast is exactly what I need for cross-thread interaction.
|
||||
|
||||
And I can probably also mitigate the second problem in Cap'n Proto: the need for code generation. Because Cap'n Proto is
|
||||
language-agnostic, it has to provide code generators that would wrap the received data structures and allow access to their
|
||||
fields through high-level object interface. But since I'm writing my own runtime, I can just build all required data structures
|
||||
into the runtime.
|
||||
|
||||
And this is what I would call "frozen data structures". From the point of view of the VM, they look like normal data structures,
|
||||
except they are laid out in memory sequentially in one block, and you can't mutate them. Frozen data structures can only have
|
||||
pointers to the same contiguous frozen memory space, and those pointers are not absolute, but just integer offsets.
|
||||
|
||||
For example, here's some pseudocode that illustrates this:
|
||||
|
||||
```
|
||||
# Somewhere in thread1
|
||||
dict = {"foo": "bar", "baz": [5,6,7,8]}
|
||||
frozen_dict = freeze(dict)
|
||||
assert frozen_dict["baz"][2] == 7
|
||||
send(thread2, frozen_dict)
|
||||
|
||||
# Somewhere in thread2
|
||||
frozen_dict = receive()
|
||||
assert frozen_dict["baz"][2] == 7
|
||||
```
|
||||
|
||||
When `freeze()` is called on a data structure, the VM walks the data structure and returns its representation
|
||||
as a contiguous immutable block. This block can then be worked with in the same way as regular data structures.
|
||||
|
||||
The second thread calls `receive()` and gets the frozen data structure, that's been copied into its memory.
|
||||
Ideally, this is as simple as a `memcpy()` plus a few bounds checks to see that nothing in that data structure
|
||||
points outside of its memory region. Then we can continue to work with it normally.
|
||||
|
||||
Now, a few words on how it is represented in the virtual machine right now. The virtual machine mostly has primitive
|
||||
data types such as various types of integers (both signed and unsigned) and arrays. The arrays have both mutable and
|
||||
immutable representation, so when a VM holds a pointer to the array, it can determine whether mutation is prohibited.
|
||||
|
||||
The more interesting part is pointers: as I was saying above, the objects laid out in a contiguous region, only
|
||||
point to each other with integer offsets. So you can't directly point to the inside of the region (at least because
|
||||
of garbage collection issues that I would explain in a separate post).
|
||||
|
||||
So, the pointers within the VM are represented roughly as:
|
||||
|
||||
```
|
||||
struct pointer_t {
|
||||
uint8_t tag;
|
||||
uint64_t offset : 56; // Use only 56 bits to fit the whole pointer into 16 bytes
|
||||
void* ptr; // Points to the beginning of contiguous memory block
|
||||
}
|
||||
```
|
||||
|
||||
This allows to assign pointers to the registers, and do all normal operations with them. For example, load/store
|
||||
operations would check if the pointer tag is related to the frozen data type and if so, the resulting object would
|
||||
also be returned as a "frozen pointer", just with an adjusted offset from the start of the region.
|
||||
|
||||
So far, I've implemented most of the required data structures, but haven't finished the access layer completely,
|
||||
so no transparent load/stores yet. Stay tuned for the updates.
|
||||
|
||||
The code for the experimental version can be found [here](https://git.sr.ht/~knazarov/lisp.experimental).
|
Loading…
Reference in a new issue