Add an article about the C++ rewrite

2024-04-01 23:33:18 +01:00 · 2024-04-01 23:33:18 +01:00 · a265fb6141
commit a265fb6141
parent c698499c51
1 changed files with 125 additions and 0 deletions
--- a/content/posts/cpp_rewrite_of_my_language/note.md
+++ b/content/posts/cpp_rewrite_of_my_language/note.md
@ -0,0 +1,125 @@
+X-Date: 2024-04-01T22:00:00Z
+X-Note-Id: 72f1a46c-d19a-413d-84a8-46be1cfa575d
+Subject: I've ported my language from C to C++ (a story of error handling)
+X-Slug: cpp_rewrite_of_my_language
+
+I've been writing my programming language in pure C for quite some time, but recently
+I decided to port it to C++. The key problem that made me do so is error handling.
+While I was working on the bytecode virtual machine, it was all relatively simple. The
+virtual machine is just a large switch over the opcodes with relatively trivial
+functions for basic arithmetic operations, jumps and conditions.
+
+As I started to work on the parser and runtime data structures, the code quickly became
+hard to reason about. This is in part because I decided to gracefully handle memory allocation
+errors. To understand the issue, let's consider a simple function, `assoc_get`, which takes
+an indexable object and returns a value at index:
+
+```
+Value obj = mk_array(10);
+Value index = mk_i64(5);
+Value val = mk_i64(42);
+
+// Writes "42" at array index 5
+assoc_set(obj, index, val)
+
+Value res = assoc_get(obj, index);
+```
+
+Now, there are 2 possible error cases here:
+
+- The index can be out of range
+- We couldn't allocate memory for a temporary value on the garbage-collected heap
+
+In both of these cases, what should be the value of `res` and how would we know that an error
+has occured? One of the options to deal with this is setting an `errno` and returning some sort of
+"placeholder" that doesn't mean anything (e.g. `nil`). Another is using "out parameters" like this:
+
+```
+Value res = mk_nil();
+ErrorCode rc = assoc_get(&res, obj, index);
+if (rc) {
+   // clean up and return
+}
+```
+
+There are also more obscure ways that some of the interpreters utilize, like doing `setjmp()` somewhere at
+the entry point of the virtual machine loop, and then `longjmp()` if there's an error down the line.
+This works in some cases, but it easily leads to resource leaks.
+
+What would be really awesome is if C had some sort of sum types, or ability to return two values from
+a function - a result and an error (pretty much like Zig or Go both do).
+
+Initially I tried to bolt on the sum types by introducing separate structs like:
+
+```
+struct ValueOrError {
+  Value result;
+  ErrorCode error;
+};
+```
+
+Following this approach, I've refactored the code so that all functions that can return an error would
+return such sum type. Like this:
+
+```
+ValueOrError res = assoc_get(obj, index);
+if (res.error) {
+  // clean up and return
+}
+
+// do something with res.result
+```
+
+This worked, but it required too much ceremony and cluttered the code. Now for every separate type that would
+be returned from a function, I had to create a "wrapper" type that essentially implements a respective result type.
+
+Eventually it led to a state where working on the codebase was no longer fun. Instead of implementing the logic,
+I had to be very verbose all the time. The worst of all is that refactoring the codebase became too taxing. Since
+error handling code needed to know the underlying structure of objects, every time I changes interfaces, things
+started to break in too many places at once (and often in runtime).
+
+So finally, I gave up and decided to use C++ where you can implement a `Result` sum type. My reasoning was that I
+can still go pretty minimal and disable exceptions, RTTI, and probably even at some point get rid of the standard
+library. But what I would get in return is a sane and clean error handling.
+
+Imagine something line this:
+
+```
+Result<Value> sum(Value array) {
+  size_t size = TRY(assoc_size(array));
+  int res = 0;
+  for (size_t i = 0; i < size; ++) {
+     Value val = TRY(assoc_get(array, mk_i64(i)));
+     res += TRY(val.get_i64());
+  }
+
+  return mk_i64(res);
+}
+```
+
+The interesting part here is the `TRY` macro. It would automatically try to unpack the `Result` object. If
+it contains an error - it would return the error from the current function. If not - the result of the expression
+would be the unpacked content of the `Result`.
+
+The implementation of the `TRY` macro is pretty straightforward:
+
+```
+#define TRY(m)                                       \
+  (({                                                \
+     auto ___res = (m);                              \
+     if (!___res.has_value()) return ___res.error(); \
+     std::move(___res);                              \
+   }).release_value())
+```
+
+The most interesting part here is `({ ... })`. This is a so-called "compound statement expression". It's a
+GCC and clang extension, that allows you to have one expression that consists of multiple operations. The
+value of the last one is what would be treated as a result of the expression. This is what allows you to
+call `return` from within the expression, which is otherwise not possible (since `return` is a statement).
+
+If you use this macro, the code becomes easy to read. You immediately see which functions can fail, and
+can bubble up errors concisely to the place that knows how to deal with them. It is almost as easy to use
+as exceptions, with the added benefit of being explicit.
+
+The reason I want to avoid exceptions is mainly because I would like to make my language embeddable, and
+exceptions don't play really well when you mix them with different language runtimes.