X-Date: 2024-07-28T23:57:33Z X-Note-Id: f61289e2-7914-4903-a19f-3ae7e59f3d29 Subject: Why I'm going for immutability X-Slug: going_for_immutability It's time to admit that I've failed with a first iteration of my programming language. And the reason for that is mutability. I've set myself a goal to have independent virtual machines that can exchange serialized objects, but that turned out to be more complex than anticipated. ## Cyclic data structures In a real program, objects may have loops. Imagine something like this in Python: ``` dict = {"a": 1, "b": 2} dict["c"] = dict print(dict) # <- ??? ``` Here, `dict` turns out to be self-referential. If you try to print it in Python, you'd get a recursion error. Same thing happens if you try to serialize objects with loops. To solve this you need to either error-out on reaching a certain recursion depth, or implement a complex algorithm for detecting loops and dealing with them by inserting special references. Some Lisp implementations (notably, Common Lisp) actually do the latter. If you try to print a data structure with cycles, they would intelligently handle that and insert user-friendly references with a special form. This is how it looks like: ``` #1=(1 2 3 . #1#) ``` It shows a list that contains itself as the last element (`#1#`). This is a nice trick, but a pretty complex one and requires tons of additional code, probably comparable with half of the implementation of my language. It is also not limited to just printing: same problem occurs in other places, like doing a deep comparison of two objects. ## Different representations of objects I've also made a second design choice which in retrospect seems suboptimal. For normal runtime data structures, there's a "mutable" implementation (so you can update values in arrays and dictionaries), but for data structures that arrive from the network or another VM, there is a special "frozen" set of types. These data types ended up creating lots of branches in the code (especially around garbage collection). Comparison between frozen data structures and normal data structures would require additional code, and overuse of the "visitor" pattern. Garbage collection turned out to be especially tricky. Since a "frozen" data structure arrives in one contiguous blob, you can only "collect" it all at once. And if you hold a reference to anything inside it (say, to a string inside a frozen list), then you hold the whole data structure in memory. Asking the user to care about the quirks of such "network" objects is probably fine if you're dealing with Protobuf where you're expected to get rid of the original message as fast as possible and convert it to native objects. My implementation, however, was designed to make this process transparent, and thus sufferend from weird hard to debug memory bloat in some cases. ## Tagging quirks In many cases, writing C++ code around Lisp objects became tricky. If you have a generic "String" object, it could either be a MutableString or FrozenString. Because their implementations were different, you can't devirtualize access to individual characters, and always have to go via a generic interface. Which, of course, doesn't make it especially simple for the optimizer to generate concise machine code. Doing a virtual function call, passing parameters to the stack, and returning back, just for reading one character at a time is definitely going to reduce speed by at least a factor of 10. Any potential vectorization opportunities will be lost. ## Unstable iterators My dictionaries were based on b-trees, so it was rather simple to add an iterator that performs an efficient in-order walk across the tree. But I started hitting problems in cases where trees were mutated at the same time as an interator was walking across one. Designing a tree that can withstand mutation and iteration at the same time turned out to require tricks that I wasn't ready to spend time on. Same story with arrays. Removing an item from an array means that the iterator potentially shifts further by the number of deleted elements. And finally, at some point in time I wanted to introduce custom iterators that users will be able to write for their data structures. Implementing an iterator that is resilient to modifications of the underlying data structures seems a major pitfall that I'm sure 99% of users will just miss. ## Garbage collection Even though "normal" algorithms for garbage collection work well with cycles, I did a bit of a read-up on how incremental collectors work. Object mutability is one of the big chunks of work that an implementation should tackle. If you want to run your garbage collector in parallel with the program, you need to make sure to prevent objects from pointing to the back at the active heap. It very quickly blows up the codebase and I'm pretty sure the one-man project can hardly afford this. ## Solution: immutability So yeah, this is where I'm at now. I decided to change the direction in a radical way, and to not allow mutation of objects at all. As soon as you create a dictionary, the only way to insert a value there is to create a new dictionary with this value in place. So, this essentially becomes: ``` dict = {"a": 1, "b": 2} dict2 = dict.insert("c", 3) ``` Here, `dict.insert()` will return the changed version of the dictionary, while the value that `dict` holds will remain intact. Even if you try to do this: ``` dict = {"a": 1, "b": 2} dict2 = dict.insert("c", dict) ``` You won't end up with a cyclic data structure, because `dict` references the old version. You may ask "how expensive it is to create new objects all the time on any change?" and you'll be right. Under a naive implementation this is expensive. But fortunately, humanity has invented tree-based data structures, where you can reuse part of the previous data structure (given that it doesn't change). For dictionaries, you can use immutable red-black trees or immutable B-trees. And for arrays and strings you can use the "rope" data structure. In the last 2 days I've hand-rolled another implementation in C++ which is very vaguely based on the previous one, but only contains a subset of data structures, and is way simpler. It already contains a "reader" and a "writer" (essentially - parser and pretty-printer). I'll publish the code in a week or so.