Add a post about immutability
This commit is contained in:
parent
a4320cf1b3
commit
8e08e2ec07
1 changed files with 122 additions and 0 deletions
122
content/posts/going_for_immutability/note.md
Normal file
122
content/posts/going_for_immutability/note.md
Normal file
|
@ -0,0 +1,122 @@
|
||||||
|
X-Date: 2024-07-28T23:57:33Z
|
||||||
|
X-Note-Id: f61289e2-7914-4903-a19f-3ae7e59f3d29
|
||||||
|
Subject: Why I'm going for immutability
|
||||||
|
X-Slug: going_for_immutability
|
||||||
|
|
||||||
|
It's time to admit that I've failed with a first iteration of my programming language.
|
||||||
|
And the reason for that is mutability. I've set myself a goal to have independent
|
||||||
|
virtual machines that can exchange serialized objects, but that turned out to be more
|
||||||
|
complex than anticipated.
|
||||||
|
|
||||||
|
## Cyclic data structures
|
||||||
|
|
||||||
|
In a real program, objects may have loops. Imagine something like this in Python:
|
||||||
|
|
||||||
|
```
|
||||||
|
dict = {"a": 1, "b": 2}
|
||||||
|
dict["c"] = dict
|
||||||
|
print(dict) # <- ???
|
||||||
|
```
|
||||||
|
|
||||||
|
Here, `dict` turns out to be self-referential. If you try to print it in Python, you'd get
|
||||||
|
a recursion error. Same thing happens if you try to serialize objects with loops. To solve
|
||||||
|
this you need to either error-out on reaching a certain recursion depth, or implement a
|
||||||
|
complex algorithm for detecting loops and dealing with them by inserting special references.
|
||||||
|
|
||||||
|
Some Lisp implementations (notably, Common Lisp) actually do the latter. If you try to print
|
||||||
|
a data structure with cycles, they would intelligently handle that and insert user-friendly
|
||||||
|
references with a special form. This is how it looks like:
|
||||||
|
|
||||||
|
```
|
||||||
|
#1=(1 2 3 . #1#)
|
||||||
|
```
|
||||||
|
|
||||||
|
It shows a list that contains itself as the last element (`#1#`).
|
||||||
|
|
||||||
|
This is a nice trick, but a pretty complex one and requires tons of additional code, probably
|
||||||
|
comparable with half of the implementation of my language. It is also not limited to just printing:
|
||||||
|
same problem occurs in other places, like doing a deep comparison of two objects.
|
||||||
|
|
||||||
|
|
||||||
|
## Different representations of objects
|
||||||
|
|
||||||
|
I've also made a second design choice which in retrospect seems suboptimal. For normal runtime
|
||||||
|
data structures, there's a "mutable" implementation (so you can update values in arrays and dictionaries),
|
||||||
|
but for data structures that arrive from the network or another VM, there is a special "frozen" set of
|
||||||
|
types.
|
||||||
|
|
||||||
|
These data types ended up creating lots of branches in the code (especially around garbage collection).
|
||||||
|
Comparison between frozen data structures and normal data structures would require additional code,
|
||||||
|
and overuse of the "visitor" pattern.
|
||||||
|
|
||||||
|
Garbage collection turned out to be especially tricky. Since a "frozen" data structure arrives in one
|
||||||
|
contiguous blob, you can only "collect" it all at once. And if you hold a reference to anything inside
|
||||||
|
it (say, to a string inside a frozen list), then you hold the whole data structure in memory.
|
||||||
|
|
||||||
|
Asking the user to care about the quirks of such "network" objects is probably fine if you're dealing
|
||||||
|
with Protobuf where you're expected to get rid of the original message as fast as possible and convert
|
||||||
|
it to native objects. My implementation, however, was designed to make this process transparent, and
|
||||||
|
thus sufferend from weird hard to debug memory bloat in some cases.
|
||||||
|
|
||||||
|
## Tagging quirks
|
||||||
|
|
||||||
|
In many cases, writing C++ code around Lisp objects became tricky. If you have a generic "String" object,
|
||||||
|
it could either be a MutableString or FrozenString. Because their implementations were different, you
|
||||||
|
can't devirtualize access to individual characters, and always have to go via a generic interface. Which,
|
||||||
|
of course, doesn't make it especially simple for the optimizer to generate concise machine code.
|
||||||
|
|
||||||
|
Doing a virtual function call, passing parameters to the stack, and returning back, just for reading one
|
||||||
|
character at a time is definitely going to reduce speed by at least a factor of 10. Any potential
|
||||||
|
vectorization opportunities will be lost.
|
||||||
|
|
||||||
|
## Unstable iterators
|
||||||
|
|
||||||
|
My dictionaries were based on b-trees, so it was rather simple to add an iterator that performs an efficient
|
||||||
|
in-order walk across the tree. But I started hitting problems in cases where trees were mutated at the same
|
||||||
|
time as an interator was walking across one. Designing a tree that can withstand mutation and iteration at
|
||||||
|
the same time turned out to require tricks that I wasn't ready to spend time on.
|
||||||
|
|
||||||
|
Same story with arrays. Removing an item from an array means that the iterator potentially shifts further
|
||||||
|
by the number of deleted elements.
|
||||||
|
|
||||||
|
And finally, at some point in time I wanted to introduce custom iterators that users will be able to write
|
||||||
|
for their data structures. Implementing an iterator that is resilient to modifications of the underlying
|
||||||
|
data structures seems a major pitfall that I'm sure 99% of users will just miss.
|
||||||
|
|
||||||
|
## Garbage collection
|
||||||
|
|
||||||
|
Even though "normal" algorithms for garbage collection work well with cycles, I did a bit of a read-up on
|
||||||
|
how incremental collectors work. Object mutability is one of the big chunks of work that an implementation
|
||||||
|
should tackle. If you want to run your garbage collector in parallel with the program, you need to make sure
|
||||||
|
to prevent objects from pointing to the back at the active heap. It very quickly blows up the codebase and
|
||||||
|
I'm pretty sure the one-man project can hardly afford this.
|
||||||
|
|
||||||
|
## Solution: immutability
|
||||||
|
|
||||||
|
So yeah, this is where I'm at now. I decided to change the direction in a radical way, and to not allow
|
||||||
|
mutation of objects at all. As soon as you create a dictionary, the only way to insert a value there is
|
||||||
|
to create a new dictionary with this value in place. So, this essentially becomes:
|
||||||
|
|
||||||
|
```
|
||||||
|
dict = {"a": 1, "b": 2}
|
||||||
|
dict2 = dict.insert("c", 3)
|
||||||
|
```
|
||||||
|
|
||||||
|
Here, `dict.insert()` will return the changed version of the dictionary, while the value that `dict` holds will
|
||||||
|
remain intact. Even if you try to do this:
|
||||||
|
|
||||||
|
|
||||||
|
```
|
||||||
|
dict = {"a": 1, "b": 2}
|
||||||
|
dict2 = dict.insert("c", dict)
|
||||||
|
```
|
||||||
|
|
||||||
|
You won't end up with a cyclic data structure, because `dict` references the old version. You may ask "how
|
||||||
|
expensive it is to create new objects all the time on any change?" and you'll be right. Under a naive implementation
|
||||||
|
this is expensive. But fortunately, humanity has invented tree-based data structures, where you can reuse part of
|
||||||
|
the previous data structure (given that it doesn't change). For dictionaries, you can use immutable red-black trees
|
||||||
|
or immutable B-trees. And for arrays and strings you can use the "rope" data structure.
|
||||||
|
|
||||||
|
In the last 2 days I've hand-rolled another implementation in C++ which is very vaguely based on the previous one,
|
||||||
|
but only contains a subset of data structures, and is way simpler. It already contains a "reader" and a "writer"
|
||||||
|
(essentially - parser and pretty-printer). I'll publish the code in a week or so.
|
Loading…
Reference in a new issue