From a265fb61413813eab0cac72e93ac6ced890890eb Mon Sep 17 00:00:00 2001 From: Konstantin Nazarov Date: Mon, 1 Apr 2024 23:33:18 +0100 Subject: [PATCH] Add an article about the C++ rewrite --- .../posts/cpp_rewrite_of_my_language/note.md | 125 ++++++++++++++++++ 1 file changed, 125 insertions(+) create mode 100644 content/posts/cpp_rewrite_of_my_language/note.md diff --git a/content/posts/cpp_rewrite_of_my_language/note.md b/content/posts/cpp_rewrite_of_my_language/note.md new file mode 100644 index 0000000..29ff555 --- /dev/null +++ b/content/posts/cpp_rewrite_of_my_language/note.md @@ -0,0 +1,125 @@ +X-Date: 2024-04-01T22:00:00Z +X-Note-Id: 72f1a46c-d19a-413d-84a8-46be1cfa575d +Subject: I've ported my language from C to C++ (a story of error handling) +X-Slug: cpp_rewrite_of_my_language + +I've been writing my programming language in pure C for quite some time, but recently +I decided to port it to C++. The key problem that made me do so is error handling. +While I was working on the bytecode virtual machine, it was all relatively simple. The +virtual machine is just a large switch over the opcodes with relatively trivial +functions for basic arithmetic operations, jumps and conditions. + +As I started to work on the parser and runtime data structures, the code quickly became +hard to reason about. This is in part because I decided to gracefully handle memory allocation +errors. To understand the issue, let's consider a simple function, `assoc_get`, which takes +an indexable object and returns a value at index: + +``` +Value obj = mk_array(10); +Value index = mk_i64(5); +Value val = mk_i64(42); + +// Writes "42" at array index 5 +assoc_set(obj, index, val) + +Value res = assoc_get(obj, index); +``` + +Now, there are 2 possible error cases here: + +- The index can be out of range +- We couldn't allocate memory for a temporary value on the garbage-collected heap + +In both of these cases, what should be the value of `res` and how would we know that an error +has occured? One of the options to deal with this is setting an `errno` and returning some sort of +"placeholder" that doesn't mean anything (e.g. `nil`). Another is using "out parameters" like this: + +``` +Value res = mk_nil(); +ErrorCode rc = assoc_get(&res, obj, index); +if (rc) { + // clean up and return +} +``` + +There are also more obscure ways that some of the interpreters utilize, like doing `setjmp()` somewhere at +the entry point of the virtual machine loop, and then `longjmp()` if there's an error down the line. +This works in some cases, but it easily leads to resource leaks. + +What would be really awesome is if C had some sort of sum types, or ability to return two values from +a function - a result and an error (pretty much like Zig or Go both do). + +Initially I tried to bolt on the sum types by introducing separate structs like: + +``` +struct ValueOrError { + Value result; + ErrorCode error; +}; +``` + +Following this approach, I've refactored the code so that all functions that can return an error would +return such sum type. Like this: + +``` +ValueOrError res = assoc_get(obj, index); +if (res.error) { + // clean up and return +} + +// do something with res.result +``` + +This worked, but it required too much ceremony and cluttered the code. Now for every separate type that would +be returned from a function, I had to create a "wrapper" type that essentially implements a respective result type. + +Eventually it led to a state where working on the codebase was no longer fun. Instead of implementing the logic, +I had to be very verbose all the time. The worst of all is that refactoring the codebase became too taxing. Since +error handling code needed to know the underlying structure of objects, every time I changes interfaces, things +started to break in too many places at once (and often in runtime). + +So finally, I gave up and decided to use C++ where you can implement a `Result` sum type. My reasoning was that I +can still go pretty minimal and disable exceptions, RTTI, and probably even at some point get rid of the standard +library. But what I would get in return is a sane and clean error handling. + +Imagine something line this: + +``` +Result sum(Value array) { + size_t size = TRY(assoc_size(array)); + int res = 0; + for (size_t i = 0; i < size; ++) { + Value val = TRY(assoc_get(array, mk_i64(i))); + res += TRY(val.get_i64()); + } + + return mk_i64(res); +} +``` + +The interesting part here is the `TRY` macro. It would automatically try to unpack the `Result` object. If +it contains an error - it would return the error from the current function. If not - the result of the expression +would be the unpacked content of the `Result`. + +The implementation of the `TRY` macro is pretty straightforward: + +``` +#define TRY(m) \ + (({ \ + auto ___res = (m); \ + if (!___res.has_value()) return ___res.error(); \ + std::move(___res); \ + }).release_value()) +``` + +The most interesting part here is `({ ... })`. This is a so-called "compound statement expression". It's a +GCC and clang extension, that allows you to have one expression that consists of multiple operations. The +value of the last one is what would be treated as a result of the expression. This is what allows you to +call `return` from within the expression, which is otherwise not possible (since `return` is a statement). + +If you use this macro, the code becomes easy to read. You immediately see which functions can fail, and +can bubble up errors concisely to the place that knows how to deal with them. It is almost as easy to use +as exceptions, with the added benefit of being explicit. + +The reason I want to avoid exceptions is mainly because I would like to make my language embeddable, and +exceptions don't play really well when you mix them with different language runtimes.