Add a post about immutability

2024-07-29 01:51:27 +01:00 · 2024-07-29 01:51:27 +01:00 · 8e08e2ec07
commit 8e08e2ec07
parent a4320cf1b3
1 changed files with 122 additions and 0 deletions
--- a/content/posts/going_for_immutability/note.md
+++ b/content/posts/going_for_immutability/note.md
@ -0,0 +1,122 @@
 X-Date: 2024-07-28T23:57:33Z
 X-Note-Id: f61289e2-7914-4903-a19f-3ae7e59f3d29
 Subject: Why I'm going for immutability
 X-Slug: going_for_immutability
 It's time to admit that I've failed with a first iteration of my programming language.
 And the reason for that is mutability. I've set myself a goal to have independent
 virtual machines that can exchange serialized objects, but that turned out to be more
 complex than anticipated.
 ## Cyclic data structures
 In a real program, objects may have loops. Imagine something like this in Python:
 ```
 dict = {"a": 1, "b": 2}
 dict["c"] = dict
 print(dict) # <- ???
 ```
 Here, `dict` turns out to be self-referential. If you try to print it in Python, you'd get
 a recursion error. Same thing happens if you try to serialize objects with loops. To solve
 this you need to either error-out on reaching a certain recursion depth, or implement a
 complex algorithm for detecting loops and dealing with them by inserting special references.
 Some Lisp implementations (notably, Common Lisp) actually do the latter. If you try to print
 a data structure with cycles, they would intelligently handle that and insert user-friendly
 references with a special form. This is how it looks like:
 ```
 #1=(1 2 3 . #1#)
 ```
 It shows a list that contains itself as the last element (`#1#`).
 This is a nice trick, but a pretty complex one and requires tons of additional code, probably
 comparable with half of the implementation of my language. It is also not limited to just printing:
 same problem occurs in other places, like doing a deep comparison of two objects.
 ## Different representations of objects
 I've also made a second design choice which in retrospect seems suboptimal. For normal runtime
 data structures, there's a "mutable" implementation (so you can update values in arrays and dictionaries),
 but for data structures that arrive from the network or another VM, there is a special "frozen" set of
 types.
 These data types ended up creating lots of branches in the code (especially around garbage collection).
 Comparison between frozen data structures and normal data structures would require additional code,
 and overuse of the "visitor" pattern.
 Garbage collection turned out to be especially tricky. Since a "frozen" data structure arrives in one
 contiguous blob, you can only "collect" it all at once. And if you hold a reference to anything inside
 it (say, to a string inside a frozen list), then you hold the whole data structure in memory.
 Asking the user to care about the quirks of such "network" objects is probably fine if you're dealing
 with Protobuf where you're expected to get rid of the original message as fast as possible and convert
 it to native objects. My implementation, however, was designed to make this process transparent, and
 thus sufferend from weird hard to debug memory bloat in some cases.
 ## Tagging quirks
 In many cases, writing C++ code around Lisp objects became tricky. If you have a generic "String" object,
 it could either be a MutableString or FrozenString. Because their implementations were different, you
 can't devirtualize access to individual characters, and always have to go via a generic interface. Which,
 of course, doesn't make it especially simple for the optimizer to generate concise machine code.
 Doing a virtual function call, passing parameters to the stack, and returning back, just for reading one
 character at a time is definitely going to reduce speed by at least a factor of 10. Any potential
 vectorization opportunities will be lost.
 ## Unstable iterators
 My dictionaries were based on b-trees, so it was rather simple to add an iterator that performs an efficient
 in-order walk across the tree. But I started hitting problems in cases where trees were mutated at the same
 time as an interator was walking across one. Designing a tree that can withstand mutation and iteration at
 the same time turned out to require tricks that I wasn't ready to spend time on.
 Same story with arrays. Removing an item from an array means that the iterator potentially shifts further
 by the number of deleted elements.
 And finally, at some point in time I wanted to introduce custom iterators that users will be able to write
 for their data structures. Implementing an iterator that is resilient to modifications of the underlying
 data structures seems a major pitfall that I'm sure 99% of users will just miss.
 ## Garbage collection
 Even though "normal" algorithms for garbage collection work well with cycles, I did a bit of a read-up on
 how incremental collectors work. Object mutability is one of the big chunks of work that an implementation
 should tackle. If you want to run your garbage collector in parallel with the program, you need to make sure
 to prevent objects from pointing to the back at the active heap. It very quickly blows up the codebase and
 I'm pretty sure the one-man project can hardly afford this.
 ## Solution: immutability
 So yeah, this is where I'm at now. I decided to change the direction in a radical way, and to not allow
 mutation of objects at all. As soon as you create a dictionary, the only way to insert a value there is
 to create a new dictionary with this value in place. So, this essentially becomes:
 ```
 dict = {"a": 1, "b": 2}
 dict2 = dict.insert("c", 3)
 ```
 Here, `dict.insert()` will return the changed version of the dictionary, while the value that `dict` holds will
 remain intact. Even if you try to do this:
 ```
 dict = {"a": 1, "b": 2}
 dict2 = dict.insert("c", dict)
 ```
 You won't end up with a cyclic data structure, because `dict` references the old version. You may ask "how
 expensive it is to create new objects all the time on any change?" and you'll be right. Under a naive implementation
 this is expensive. But fortunately, humanity has invented tree-based data structures, where you can reuse part of
 the previous data structure (given that it doesn't change). For dictionaries, you can use immutable red-black trees
 or immutable B-trees. And for arrays and strings you can use the "rope" data structure.
 In the last 2 days I've hand-rolled another implementation in C++ which is very vaguely based on the previous one,
 but only contains a subset of data structures, and is way simpler. It already contains a "reader" and a "writer"
 (essentially - parser and pretty-printer). I'll publish the code in a week or so.