Add a post about immutability

2024-07-29 01:51:27 +01:00 · 2024-07-29 01:51:27 +01:00 · 8e08e2ec07
commit 8e08e2ec07
parent a4320cf1b3
1 changed files with 122 additions and 0 deletions
--- a/content/posts/going_for_immutability/note.md
+++ b/content/posts/going_for_immutability/note.md
@ -0,0 +1,122 @@
+X-Date: 2024-07-28T23:57:33Z
+X-Note-Id: f61289e2-7914-4903-a19f-3ae7e59f3d29
+Subject: Why I'm going for immutability
+X-Slug: going_for_immutability
+
+It's time to admit that I've failed with a first iteration of my programming language.
+And the reason for that is mutability. I've set myself a goal to have independent
+virtual machines that can exchange serialized objects, but that turned out to be more
+complex than anticipated.
+
+## Cyclic data structures
+
+In a real program, objects may have loops. Imagine something like this in Python:
+
+```
+dict = {"a": 1, "b": 2}
+dict["c"] = dict
+print(dict) # <- ???
+```
+
+Here, `dict` turns out to be self-referential. If you try to print it in Python, you'd get
+a recursion error. Same thing happens if you try to serialize objects with loops. To solve
+this you need to either error-out on reaching a certain recursion depth, or implement a
+complex algorithm for detecting loops and dealing with them by inserting special references.
+
+Some Lisp implementations (notably, Common Lisp) actually do the latter. If you try to print
+a data structure with cycles, they would intelligently handle that and insert user-friendly
+references with a special form. This is how it looks like:
+
+```
+#1=(1 2 3 . #1#)
+```
+
+It shows a list that contains itself as the last element (`#1#`).
+
+This is a nice trick, but a pretty complex one and requires tons of additional code, probably
+comparable with half of the implementation of my language. It is also not limited to just printing:
+same problem occurs in other places, like doing a deep comparison of two objects.
+
+
+## Different representations of objects
+
+I've also made a second design choice which in retrospect seems suboptimal. For normal runtime
+data structures, there's a "mutable" implementation (so you can update values in arrays and dictionaries),
+but for data structures that arrive from the network or another VM, there is a special "frozen" set of
+types.
+
+These data types ended up creating lots of branches in the code (especially around garbage collection).
+Comparison between frozen data structures and normal data structures would require additional code,
+and overuse of the "visitor" pattern.
+
+Garbage collection turned out to be especially tricky. Since a "frozen" data structure arrives in one
+contiguous blob, you can only "collect" it all at once. And if you hold a reference to anything inside
+it (say, to a string inside a frozen list), then you hold the whole data structure in memory.
+
+Asking the user to care about the quirks of such "network" objects is probably fine if you're dealing
+with Protobuf where you're expected to get rid of the original message as fast as possible and convert
+it to native objects. My implementation, however, was designed to make this process transparent, and
+thus sufferend from weird hard to debug memory bloat in some cases.
+
+## Tagging quirks
+
+In many cases, writing C++ code around Lisp objects became tricky. If you have a generic "String" object,
+it could either be a MutableString or FrozenString. Because their implementations were different, you
+can't devirtualize access to individual characters, and always have to go via a generic interface. Which,
+of course, doesn't make it especially simple for the optimizer to generate concise machine code.
+
+Doing a virtual function call, passing parameters to the stack, and returning back, just for reading one
+character at a time is definitely going to reduce speed by at least a factor of 10. Any potential
+vectorization opportunities will be lost.
+
+## Unstable iterators
+
+My dictionaries were based on b-trees, so it was rather simple to add an iterator that performs an efficient
+in-order walk across the tree. But I started hitting problems in cases where trees were mutated at the same
+time as an interator was walking across one. Designing a tree that can withstand mutation and iteration at
+the same time turned out to require tricks that I wasn't ready to spend time on.
+
+Same story with arrays. Removing an item from an array means that the iterator potentially shifts further
+by the number of deleted elements.
+
+And finally, at some point in time I wanted to introduce custom iterators that users will be able to write
+for their data structures. Implementing an iterator that is resilient to modifications of the underlying
+data structures seems a major pitfall that I'm sure 99% of users will just miss.
+
+## Garbage collection
+
+Even though "normal" algorithms for garbage collection work well with cycles, I did a bit of a read-up on
+how incremental collectors work. Object mutability is one of the big chunks of work that an implementation
+should tackle. If you want to run your garbage collector in parallel with the program, you need to make sure
+to prevent objects from pointing to the back at the active heap. It very quickly blows up the codebase and
+I'm pretty sure the one-man project can hardly afford this.
+
+## Solution: immutability
+
+So yeah, this is where I'm at now. I decided to change the direction in a radical way, and to not allow
+mutation of objects at all. As soon as you create a dictionary, the only way to insert a value there is
+to create a new dictionary with this value in place. So, this essentially becomes:
+
+```
+dict = {"a": 1, "b": 2}
+dict2 = dict.insert("c", 3)
+```
+
+Here, `dict.insert()` will return the changed version of the dictionary, while the value that `dict` holds will
+remain intact. Even if you try to do this:
+
+
+```
+dict = {"a": 1, "b": 2}
+dict2 = dict.insert("c", dict)
+```
+
+You won't end up with a cyclic data structure, because `dict` references the old version. You may ask "how
+expensive it is to create new objects all the time on any change?" and you'll be right. Under a naive implementation
+this is expensive. But fortunately, humanity has invented tree-based data structures, where you can reuse part of
+the previous data structure (given that it doesn't change). For dictionaries, you can use immutable red-black trees
+or immutable B-trees. And for arrays and strings you can use the "rope" data structure.
+
+In the last 2 days I've hand-rolled another implementation in C++ which is very vaguely based on the previous one,
+but only contains a subset of data structures, and is way simpler. It already contains a "reader" and a "writer"
+(essentially - parser and pretty-printer). I'll publish the code in a week or so.