Add a post about immutability

This commit is contained in:
Konstantin Nazarov 2024-07-29 01:51:27 +01:00
parent a4320cf1b3
commit 8e08e2ec07
Signed by: knazarov
GPG key ID: 4CFE0A42FA409C22

View file

@ -0,0 +1,122 @@
X-Date: 2024-07-28T23:57:33Z
X-Note-Id: f61289e2-7914-4903-a19f-3ae7e59f3d29
Subject: Why I'm going for immutability
X-Slug: going_for_immutability
It's time to admit that I've failed with a first iteration of my programming language.
And the reason for that is mutability. I've set myself a goal to have independent
virtual machines that can exchange serialized objects, but that turned out to be more
complex than anticipated.
## Cyclic data structures
In a real program, objects may have loops. Imagine something like this in Python:
```
dict = {"a": 1, "b": 2}
dict["c"] = dict
print(dict) # <- ???
```
Here, `dict` turns out to be self-referential. If you try to print it in Python, you'd get
a recursion error. Same thing happens if you try to serialize objects with loops. To solve
this you need to either error-out on reaching a certain recursion depth, or implement a
complex algorithm for detecting loops and dealing with them by inserting special references.
Some Lisp implementations (notably, Common Lisp) actually do the latter. If you try to print
a data structure with cycles, they would intelligently handle that and insert user-friendly
references with a special form. This is how it looks like:
```
#1=(1 2 3 . #1#)
```
It shows a list that contains itself as the last element (`#1#`).
This is a nice trick, but a pretty complex one and requires tons of additional code, probably
comparable with half of the implementation of my language. It is also not limited to just printing:
same problem occurs in other places, like doing a deep comparison of two objects.
## Different representations of objects
I've also made a second design choice which in retrospect seems suboptimal. For normal runtime
data structures, there's a "mutable" implementation (so you can update values in arrays and dictionaries),
but for data structures that arrive from the network or another VM, there is a special "frozen" set of
types.
These data types ended up creating lots of branches in the code (especially around garbage collection).
Comparison between frozen data structures and normal data structures would require additional code,
and overuse of the "visitor" pattern.
Garbage collection turned out to be especially tricky. Since a "frozen" data structure arrives in one
contiguous blob, you can only "collect" it all at once. And if you hold a reference to anything inside
it (say, to a string inside a frozen list), then you hold the whole data structure in memory.
Asking the user to care about the quirks of such "network" objects is probably fine if you're dealing
with Protobuf where you're expected to get rid of the original message as fast as possible and convert
it to native objects. My implementation, however, was designed to make this process transparent, and
thus sufferend from weird hard to debug memory bloat in some cases.
## Tagging quirks
In many cases, writing C++ code around Lisp objects became tricky. If you have a generic "String" object,
it could either be a MutableString or FrozenString. Because their implementations were different, you
can't devirtualize access to individual characters, and always have to go via a generic interface. Which,
of course, doesn't make it especially simple for the optimizer to generate concise machine code.
Doing a virtual function call, passing parameters to the stack, and returning back, just for reading one
character at a time is definitely going to reduce speed by at least a factor of 10. Any potential
vectorization opportunities will be lost.
## Unstable iterators
My dictionaries were based on b-trees, so it was rather simple to add an iterator that performs an efficient
in-order walk across the tree. But I started hitting problems in cases where trees were mutated at the same
time as an interator was walking across one. Designing a tree that can withstand mutation and iteration at
the same time turned out to require tricks that I wasn't ready to spend time on.
Same story with arrays. Removing an item from an array means that the iterator potentially shifts further
by the number of deleted elements.
And finally, at some point in time I wanted to introduce custom iterators that users will be able to write
for their data structures. Implementing an iterator that is resilient to modifications of the underlying
data structures seems a major pitfall that I'm sure 99% of users will just miss.
## Garbage collection
Even though "normal" algorithms for garbage collection work well with cycles, I did a bit of a read-up on
how incremental collectors work. Object mutability is one of the big chunks of work that an implementation
should tackle. If you want to run your garbage collector in parallel with the program, you need to make sure
to prevent objects from pointing to the back at the active heap. It very quickly blows up the codebase and
I'm pretty sure the one-man project can hardly afford this.
## Solution: immutability
So yeah, this is where I'm at now. I decided to change the direction in a radical way, and to not allow
mutation of objects at all. As soon as you create a dictionary, the only way to insert a value there is
to create a new dictionary with this value in place. So, this essentially becomes:
```
dict = {"a": 1, "b": 2}
dict2 = dict.insert("c", 3)
```
Here, `dict.insert()` will return the changed version of the dictionary, while the value that `dict` holds will
remain intact. Even if you try to do this:
```
dict = {"a": 1, "b": 2}
dict2 = dict.insert("c", dict)
```
You won't end up with a cyclic data structure, because `dict` references the old version. You may ask "how
expensive it is to create new objects all the time on any change?" and you'll be right. Under a naive implementation
this is expensive. But fortunately, humanity has invented tree-based data structures, where you can reuse part of
the previous data structure (given that it doesn't change). For dictionaries, you can use immutable red-black trees
or immutable B-trees. And for arrays and strings you can use the "rope" data structure.
In the last 2 days I've hand-rolled another implementation in C++ which is very vaguely based on the previous one,
but only contains a subset of data structures, and is way simpler. It already contains a "reader" and a "writer"
(essentially - parser and pretty-printer). I'll publish the code in a week or so.