Add an article about the C++ rewrite

This commit is contained in:
Konstantin Nazarov 2024-04-01 23:33:18 +01:00
parent c698499c51
commit a265fb6141
Signed by: knazarov
GPG key ID: 4CFE0A42FA409C22

View file

@ -0,0 +1,125 @@
X-Date: 2024-04-01T22:00:00Z
X-Note-Id: 72f1a46c-d19a-413d-84a8-46be1cfa575d
Subject: I've ported my language from C to C++ (a story of error handling)
X-Slug: cpp_rewrite_of_my_language
I've been writing my programming language in pure C for quite some time, but recently
I decided to port it to C++. The key problem that made me do so is error handling.
While I was working on the bytecode virtual machine, it was all relatively simple. The
virtual machine is just a large switch over the opcodes with relatively trivial
functions for basic arithmetic operations, jumps and conditions.
As I started to work on the parser and runtime data structures, the code quickly became
hard to reason about. This is in part because I decided to gracefully handle memory allocation
errors. To understand the issue, let's consider a simple function, `assoc_get`, which takes
an indexable object and returns a value at index:
```
Value obj = mk_array(10);
Value index = mk_i64(5);
Value val = mk_i64(42);
// Writes "42" at array index 5
assoc_set(obj, index, val)
Value res = assoc_get(obj, index);
```
Now, there are 2 possible error cases here:
- The index can be out of range
- We couldn't allocate memory for a temporary value on the garbage-collected heap
In both of these cases, what should be the value of `res` and how would we know that an error
has occured? One of the options to deal with this is setting an `errno` and returning some sort of
"placeholder" that doesn't mean anything (e.g. `nil`). Another is using "out parameters" like this:
```
Value res = mk_nil();
ErrorCode rc = assoc_get(&res, obj, index);
if (rc) {
// clean up and return
}
```
There are also more obscure ways that some of the interpreters utilize, like doing `setjmp()` somewhere at
the entry point of the virtual machine loop, and then `longjmp()` if there's an error down the line.
This works in some cases, but it easily leads to resource leaks.
What would be really awesome is if C had some sort of sum types, or ability to return two values from
a function - a result and an error (pretty much like Zig or Go both do).
Initially I tried to bolt on the sum types by introducing separate structs like:
```
struct ValueOrError {
Value result;
ErrorCode error;
};
```
Following this approach, I've refactored the code so that all functions that can return an error would
return such sum type. Like this:
```
ValueOrError res = assoc_get(obj, index);
if (res.error) {
// clean up and return
}
// do something with res.result
```
This worked, but it required too much ceremony and cluttered the code. Now for every separate type that would
be returned from a function, I had to create a "wrapper" type that essentially implements a respective result type.
Eventually it led to a state where working on the codebase was no longer fun. Instead of implementing the logic,
I had to be very verbose all the time. The worst of all is that refactoring the codebase became too taxing. Since
error handling code needed to know the underlying structure of objects, every time I changes interfaces, things
started to break in too many places at once (and often in runtime).
So finally, I gave up and decided to use C++ where you can implement a `Result` sum type. My reasoning was that I
can still go pretty minimal and disable exceptions, RTTI, and probably even at some point get rid of the standard
library. But what I would get in return is a sane and clean error handling.
Imagine something line this:
```
Result<Value> sum(Value array) {
size_t size = TRY(assoc_size(array));
int res = 0;
for (size_t i = 0; i < size; ++) {
Value val = TRY(assoc_get(array, mk_i64(i)));
res += TRY(val.get_i64());
}
return mk_i64(res);
}
```
The interesting part here is the `TRY` macro. It would automatically try to unpack the `Result` object. If
it contains an error - it would return the error from the current function. If not - the result of the expression
would be the unpacked content of the `Result`.
The implementation of the `TRY` macro is pretty straightforward:
```
#define TRY(m) \
(({ \
auto ___res = (m); \
if (!___res.has_value()) return ___res.error(); \
std::move(___res); \
}).release_value())
```
The most interesting part here is `({ ... })`. This is a so-called "compound statement expression". It's a
GCC and clang extension, that allows you to have one expression that consists of multiple operations. The
value of the last one is what would be treated as a result of the expression. This is what allows you to
call `return` from within the expression, which is otherwise not possible (since `return` is a statement).
If you use this macro, the code becomes easy to read. You immediately see which functions can fail, and
can bubble up errors concisely to the place that knows how to deal with them. It is almost as easy to use
as exceptions, with the added benefit of being explicit.
The reason I want to avoid exceptions is mainly because I would like to make my language embeddable, and
exceptions don't play really well when you mix them with different language runtimes.