diff --git a/content/posts/going_for_immutability/note.md b/content/posts/going_for_immutability/note.md new file mode 100644 index 0000000..edf8aa8 --- /dev/null +++ b/content/posts/going_for_immutability/note.md @@ -0,0 +1,122 @@ +X-Date: 2024-07-28T23:57:33Z +X-Note-Id: f61289e2-7914-4903-a19f-3ae7e59f3d29 +Subject: Why I'm going for immutability +X-Slug: going_for_immutability + +It's time to admit that I've failed with a first iteration of my programming language. +And the reason for that is mutability. I've set myself a goal to have independent +virtual machines that can exchange serialized objects, but that turned out to be more +complex than anticipated. + +## Cyclic data structures + +In a real program, objects may have loops. Imagine something like this in Python: + +``` +dict = {"a": 1, "b": 2} +dict["c"] = dict +print(dict) # <- ??? +``` + +Here, `dict` turns out to be self-referential. If you try to print it in Python, you'd get +a recursion error. Same thing happens if you try to serialize objects with loops. To solve +this you need to either error-out on reaching a certain recursion depth, or implement a +complex algorithm for detecting loops and dealing with them by inserting special references. + +Some Lisp implementations (notably, Common Lisp) actually do the latter. If you try to print +a data structure with cycles, they would intelligently handle that and insert user-friendly +references with a special form. This is how it looks like: + +``` +#1=(1 2 3 . #1#) +``` + +It shows a list that contains itself as the last element (`#1#`). + +This is a nice trick, but a pretty complex one and requires tons of additional code, probably +comparable with half of the implementation of my language. It is also not limited to just printing: +same problem occurs in other places, like doing a deep comparison of two objects. + + +## Different representations of objects + +I've also made a second design choice which in retrospect seems suboptimal. For normal runtime +data structures, there's a "mutable" implementation (so you can update values in arrays and dictionaries), +but for data structures that arrive from the network or another VM, there is a special "frozen" set of +types. + +These data types ended up creating lots of branches in the code (especially around garbage collection). +Comparison between frozen data structures and normal data structures would require additional code, +and overuse of the "visitor" pattern. + +Garbage collection turned out to be especially tricky. Since a "frozen" data structure arrives in one +contiguous blob, you can only "collect" it all at once. And if you hold a reference to anything inside +it (say, to a string inside a frozen list), then you hold the whole data structure in memory. + +Asking the user to care about the quirks of such "network" objects is probably fine if you're dealing +with Protobuf where you're expected to get rid of the original message as fast as possible and convert +it to native objects. My implementation, however, was designed to make this process transparent, and +thus sufferend from weird hard to debug memory bloat in some cases. + +## Tagging quirks + +In many cases, writing C++ code around Lisp objects became tricky. If you have a generic "String" object, +it could either be a MutableString or FrozenString. Because their implementations were different, you +can't devirtualize access to individual characters, and always have to go via a generic interface. Which, +of course, doesn't make it especially simple for the optimizer to generate concise machine code. + +Doing a virtual function call, passing parameters to the stack, and returning back, just for reading one +character at a time is definitely going to reduce speed by at least a factor of 10. Any potential +vectorization opportunities will be lost. + +## Unstable iterators + +My dictionaries were based on b-trees, so it was rather simple to add an iterator that performs an efficient +in-order walk across the tree. But I started hitting problems in cases where trees were mutated at the same +time as an interator was walking across one. Designing a tree that can withstand mutation and iteration at +the same time turned out to require tricks that I wasn't ready to spend time on. + +Same story with arrays. Removing an item from an array means that the iterator potentially shifts further +by the number of deleted elements. + +And finally, at some point in time I wanted to introduce custom iterators that users will be able to write +for their data structures. Implementing an iterator that is resilient to modifications of the underlying +data structures seems a major pitfall that I'm sure 99% of users will just miss. + +## Garbage collection + +Even though "normal" algorithms for garbage collection work well with cycles, I did a bit of a read-up on +how incremental collectors work. Object mutability is one of the big chunks of work that an implementation +should tackle. If you want to run your garbage collector in parallel with the program, you need to make sure +to prevent objects from pointing to the back at the active heap. It very quickly blows up the codebase and +I'm pretty sure the one-man project can hardly afford this. + +## Solution: immutability + +So yeah, this is where I'm at now. I decided to change the direction in a radical way, and to not allow +mutation of objects at all. As soon as you create a dictionary, the only way to insert a value there is +to create a new dictionary with this value in place. So, this essentially becomes: + +``` +dict = {"a": 1, "b": 2} +dict2 = dict.insert("c", 3) +``` + +Here, `dict.insert()` will return the changed version of the dictionary, while the value that `dict` holds will +remain intact. Even if you try to do this: + + +``` +dict = {"a": 1, "b": 2} +dict2 = dict.insert("c", dict) +``` + +You won't end up with a cyclic data structure, because `dict` references the old version. You may ask "how +expensive it is to create new objects all the time on any change?" and you'll be right. Under a naive implementation +this is expensive. But fortunately, humanity has invented tree-based data structures, where you can reuse part of +the previous data structure (given that it doesn't change). For dictionaries, you can use immutable red-black trees +or immutable B-trees. And for arrays and strings you can use the "rope" data structure. + +In the last 2 days I've hand-rolled another implementation in C++ which is very vaguely based on the previous one, +but only contains a subset of data structures, and is way simpler. It already contains a "reader" and a "writer" +(essentially - parser and pretty-printer). I'll publish the code in a week or so.