luiscubal's code sketchbook: November 2012

Last Friday, I woke up wondering "how hard would it really be to write my own garbage collector"?
Now, I have an answer.

Required Knowledge

In order to understand this post, you will need moderate programming knowledge.
You should understand how to code in Java (or a similar language) and C or very low-level C++.
Basic assembly concepts will be needed as well (e.g. you should understand what a processor register is).
Understanding the x86 architecture will also help, but it's not strictly required.
Understanding how the JVM or .NET runtimes work will also help, but it's not required either
Basic compiler theory knowledge will help, although this prototype is not technically a compiler.

What is a Garbage Collector?

Pretty much all applications need to store data somewhere. Most programming languages have the concept of variable - or, at very least, the concept of function argument - but those have to be stored somewhere. The two major options are CPU registers (which can only store small amounts of memory at any given time) or computer memory.
In most cases, there are three major memory regions where data can be stored:

For global data, it might be stored in dedicated memory sections (when talking about this data, you may see the names "data section" and "bss section", but I won't cover these details here).
For local data, needed only for the duration of a function (e.g. function arguments and local function variables), the preferred choice is to store data in the stack. However, variables stored in the stack shouldn't be too many nor too big or else bad things will happen.
For everything else: data that isn't in a global variable, but can't just be tossed away when the function returns, we store it elsewhere in the heap.

Unfortunately, heap memory requires special care, unlike global and stack memory which don't usually need much attention from the programmer.
Heap memory isn't "just there to be used" like the other two mentioned memory storage options. Instead, we must tell the operating system when we need more memory(allocate memory) and when we no longer need it (deallocate or free memory).

In the old days, heap memory was manually allocated and deallocated by the programmer as needed.

#include <stdlib.h>

struct foo {
    int x, y;
};

int main() {
    //First we allocate memory in the heap
    struct foo* bar = malloc(sizeof(struct foo));

    //Then we use it
    bar->x = 2;
    bar->y = 3;

    //Then we deallocate it
    free(bar);

    return 0;
}

So far, so good. But it turns out that things aren't always this simple.
Often, heap memory will be around for a long time, be used in many portions of the program, and then nobody knows who is supposed to deallocate it.
Sometimes, the program might try to deallocate memory multiple times or it might try to use memory after being deallocated, resulting in serious problems ranging from weird unpredictable behavior to program crashes.
On the other hand, some variable might no longer be needed and, therefore, be forgotten by the program. But it's still there - the program merely forgot to deallocate it. If this happens with big variables or too often, then these no longer needed chunks of memory gradually accumulate over time, and then we have small programs consuming massive amounts of memory but only needing a small fraction of it - we call that a memory leak - an annoying and problematic type of bug.
Nowadays, there are many ways to ensure proper heap memory usage. Programming languages such as C++ offer multiple mechanisms to safely deal with heap memory(such as std::unique_ptr).
Other programming languages, such as Java, C# and JavaScript, used an entirely different approach - to take away the responsibility of dealing with heap deallocation away from the programmer entirely.
Instead of relying on the programmer to deallocate memory, these languages allow (or, in fact, force) the programmer to let the computer automatically take care of that. In that case, the computer automatically detects unreachable memory - memory that the computer knows won't ever be used again - and frees it without asking for the programmer's permission. We call the subsystems that deal with that garbage collectors.
These subsystems often add overhead to the program - making it slower, consuming memory of their own, or both - so a lot of research has been done to minimize their negative effects.

In truth, some types of garbage collectors can be combined with some sort of manual memory management, but that's outside of the scope of this post.

Previous Work - Reference Counters

I had already some limited previous experience in garbage collectors, since I had already written at least one reference counter before.
A reference counter is a primitive type of garbage collector. With these systems, allocated memory comes with an integer variable that's the counter.
Every time a new part of the program needs the variable, the counter is incremented.

#include <stdlib.h>

struct foo {
    int x, y;
    unsigned counter;
};

int main() {
    //First we allocate memory in the heap
    struct foo* bar = malloc(sizeof(struct foo));
    
    //We are currently referencing it with variable bar, so
    //our counter must be 1
    bar->counter = 1;

    //Then we use it
    bar->x = 2;
    bar->y = 3;

    //Then we deallocate it
    --bar->counter;
    if (bar->counter == 0) {
        free(bar);
    }

    return 0;
}

C++ makes reference counters easy to implement and use safely:

struct foo {
    int x, y;
    int counter;
    foo() : counter(0) {}
};

class foo_ref {
    foo* ref;
public:
    foo_ref() : ref(new foo()) { ref->counter = 1; }
    foo_ref(const foo_ref& other) : ref(other.ref) {
        if (ref) {
            ++ref->counter;
        }
    } 
    foo_ref& operator =(const foo_ref& other) {
        if (other.ref) {
            ++other.ref->counter;
        }
        if (ref) {
            --ref->counter;
            if (ref->counter == 0) {
                delete ref;
            }
        }
        ref = other.ref;
    }
    ~foo_ref() {
        if (ref) {
            --ref->counter;
            if (ref->counter == 0) {
                delete ref;
            }
        }
    }

    foo_ref* operator *() { return ref; }
    foo_ref* operator ->() { return ref; }
};

int main() {
    //First we allocate the memory
    foo_ref bar;

    //Then we use it
    bar->x = 2;
    bar->y = 3;

    //Then C++ releases it for us
    return 0;
}

However, as you can see, the basic concept is the same - it's just easier to use.

Every time some part of the program no longer needs the variable, the counter is decremented. When this happens, if the counter has reached zero, that means nobody needs the variable anymore and its memory is deallocated.
Reference counters are very simple, well understood and can be made thread-safe. Because of this, many programmers - even those that dislike garbage collectors in general - often use them. After all, being simple and well understood usually means having few bugs.
They are so popular that even C++'s standard libraries include them nowadays.

However, reference counters have a few problems.
First of all, all allocated heap memory needs a counter. Counters need memory, so reference counters add some memory overhead.
In second place, incrementing and decrementing that counter takes CPU time. If this happens too often, the program becomes slower.
Lastly and, perhaps most importantly, reference counters are occasionally unable to prevent memory leaks. This third point needs some extra explanation.

Often variables add reference other variables. For instance, consider the following Java program:

public class Foo {
    public Foo next;
    public int value;
}

In the program above, a variable of type Foo might reference another Foo, which may in turn reference another Foo. References form a directed graph.
This is all fine until we introduce cycles to the graph. Consider this case:

Foo a = new Foo();
Foo b = new Foo();
Foo c = new Foo();
a.next = b;
b.next = c;
c.next = a;

This is where our problems begin. Because a needs b, we increment the counter for b. Because b needs c, we increment the counter for c. And because c needs a, we increment the counter for a.
Because they each need each other, even when they go out of scope, they still reference each other. Because they are referencing each other, their counters never reach zero. And because their counters never reach zero, they are never deallocated. So, once again, we have a memory leak.
Other types garbage collectors fix this problem.

There are tons of garbage collector algorithms to choose from. Each has different advantages and different problems. An example of a simple garbage collector that does not have this problem is the mark and sweep.

Mark and Sweep

Mark and sweep is a very simple - and supposedly not very efficient - garbage collector. Unlike the reference counter, mark and sweep handles cyclic references without any problems.
The algorithm is something like this:

Mark which regions of memory are still needed
Sweep away those that are not.

So, how exactly do we find out which regions of memory are still needed?
First of all, we'll define the concept of roots. A root is an object that is currently and directly being referenced. For instance, a variable in a currently-being-executed function is a root. Another type of root is a static class field.
Roots can never be garbage collected, since they might still be needed by the program.

An object might still be needed if it is reachable - that is, if it is a root or if it is referenced by another reachable object.
So, if S is the set of all objects, R is the set of all roots and T is the set of all objects that are currently still needed, the mark phase is equivalent to this (although implemented differently):

Let T = R
Add to T all objects in S that are currently being referenced by an object in T
Repeat 2 until no object in T references an object not in T.

In practice, this can be implemented by something like this:

void mark(Object o) {
    if (!o.marked) {
        o.marked = true;
        for (Object o2 : o.getReferencedObjects()) {
            mark(o2);
        }
    }
}

void mark() {
    for (Object o : objects) {
        o.marked = false; //Initially, no object is marked
    }
    for (Object o : roots) {
        mark(o); //Mark all roots.
    }
}

Then, once that is done, we proceed to the sweep algorithm:

void sweep() {
    for (Object o : objects) {
        if (!o.marked) {
            free(o);
        }
    }
}

This might look simple, but there are a few problems:

We need to be able to identify which objects are roots.
We need to be able to identify which objects an object references.

The solution to these problems is an implementation detail.
Good news for those who just need to understand the big picture, but it's not much of a consolation for those who need to implement one because when we are implementing something, pretty much by definition, we do care about implementation details. Heck, we make the implementation details.

My Implementation

My code, available in https://github.com/luiscubal/gctest, is written in low-level C++ (with a little bit of x86-64 assembly). That means it is painfully aware of details such as POD types. Class inheritance is manually implemented using lots of somewhat ugly pointer operations.

This project attempts to simulate a programming language environment similar to - but probably simpler than - Java. It has objects, primitives and arrays. Classes can extend other classes, but multiple inheritance is not supported.
This hypothetical language does support static fields, but that feature wasn't implemented in this prototype.

This prototype also does not attempt to specify how virtual functions are called. It doesn't offer any special support for that but it does allow finding out what the class type of an object is, so virtual methods are certainly possible, albeit somewhat inefficient.

It also does not enforce access modifiers itself. Instead, it relies on an hypothetical code generator subsystem to do that.

The prototype does not run garbage collection on classes themselves. For instance, if the language offered some sort of ClassLoader-like API to load new classes, those classes wouldn't be collected even if they went out of scope. However, I believe that adding support for this wouldn't be too difficult.

The prototype does not offer support for any sort of C#-style structs, but I believe those should be relatively easy to add. As for generics, those could probably be implemented on the code generator, without needing any special treatment from the GC. As far as the GC subsystem would be concerned, Foo<A> and Foo<B> would be entirely different unrelated types.

Finally, the runtime does not offer any support for constructors, destructors nor exceptions. The garbage collector/allocator was developed under the assumption that constructors and exceptions are the sole responsibility of the code generator - e.g. a LLVM-based compiler.
Destructors could potentially be implemented in a later phase, but those would require some extra work - probably an additional step between mark and sweep (mark, destroy and sweep?). Destructors are problematic because they might themselves cause the object to become reachable:

public class Foo {
    public static Foo x;

    ~Foo() { Foo.x = this; }
}

My implementation is written in a class named gc_context. I've tried to avoid global variables as much as possible because I wanted it to be possible to run many isolated garbage collected environments in a single process. However, my testing program does indeed create a global gc_context object.

Because no complex pointer arithmetic is applied to gc_contexts, all the C++ bells and whistles can be used. gc_context is not a POD type.

The code was written to allow the following program execution design:

The environment - e.g. class type information - is constructed, possibly generated from an external source (e.g. the equivalent of a JAR file). In this prototype, this step was hard-coded in the main function.
The "Context" then generates required runtime information - e.g. the size of each class and the memory offsets of each class field. This step was fully implemented in this prototype.
The code itself - possibly generated with some JIT compilation technology - is then executed. In this prototype, the code was written in a C++ subset designed to mimic the generated JIT code, and compiled using an ordinary C++ compiler (Windows 64-bit GCC/MinGW).

The first problem I attempted to solve was "how to identify the roots". In this prototype, the roots are all in the stack (since there are no static fields yet), so all I needed to do was figure out how to read the stack.
In the x86 architecture, the current stack location is stored in the %esp register (%rsp in 64-bits) and the stack grew downwards.
Since the GC subsystem runs in the same thread as the rest of the program, they share threads.
Therefore, the GC can find out what the current stack position is and, with it, what portion of the stack is "active".

Of course, it still doesn't know which parts of the stack are pointers and which are integers that just happen to look like pointers. However, the code doesn't care and sees everything in the stack as a potential root. This does mean there's a risk of false positives occurring during the marking phase, but it simplifies the garbage collector and it's very fast. Also, the probability that some random integer exactly matches a valid pointer is very low.
The Mono project SGen pages contain some interesting information, and it's where I took the idea of accepting false positives in the stack, although Mono's SGen uses an algorithm that's different from what we do.

In the future, I might have to add the general purpose registers to the list of roots, but for now this seems to work.

Without further delay, here is a portion of the test code:

gc_context* ctx;

int main() {
    //Build environment
    ctx = new gc_context(get_stack_pointer());

    class_type core_Link;
    core_Link.full_name = "core.Link";
    core_Link.base_type = 0;

    field core_Link_prev;
    core_Link_prev.type = "Lcore.Link;";
    core_Link_prev.flags.is_static = 0;
    core_Link.fields.push_back(core_Link_prev);

    field core_Link_next;
    core_Link_next.type = "Lcore.Link;";
    core_Link_next.flags.is_static = 0;
    core_Link.fields.push_back(core_Link_next);

    field core_Link_val;
    core_Link_val.type = "I";
    core_Link_val.flags.is_static = 0;
    core_Link.fields.push_back(core_Link_val);

    ctx->push_class_type(&core_Link);

    //Compute sizes and offsets
    ctx->compute_sizes();
    ctx->log_headers();

    //Execute code
    test_array();
    cout << "Now list" << endl;
    test_linked_list();
    cout << "Array again" << endl;
    test_array();

    //Some debugging ending code
    cout << ctx->count_heaps() << endl;

    cout.flush();
    cerr.flush();
}

In the example above, test_array and test_linked_list are two functions designed to stress the garbage collector in different ways.

More details

Each program object instance follows the same convention: A common header named core_representation that - among other things - specifies its type and some content.
The header is specified as a POD C-style struct, and lots of horrible pointer arithmetic operations are done with it.
The size of each instance equals the size of the header plus the size of its fields (plus some extra padding for alignment purposes).

struct core_representation {
    const char* type;
    mark_id_t last_mark;
};

void do_something(core_representation* instance) {
    *((uint32_t*) ((char*) instance + some_uint32_field_offset)) = 10;
}

The garbage collector is triggered any time a memory allocation fails.
The GC context owns a number of gc_heap objects which are regions of heap memory. Every time those gc_heap objects get full, it first attempts to clean them and, if that's not enough, it allocates more memory.

gc_heap objects have a default size of 4 KiB. However, if an object is too big to fit in that default size, a bigger gc_heap object is created.

Each gc_heap object is divided into small units. For each of those units, there are two bits - one specifying if it is being used and another one specifying if it is the beginning of an object/array.
The used bit is stored in a variable named heap_bitset. The start-of-an-object bit is stored in a variable named heap_starts. These variables can be represented with the STL type vector<bool>, since we do not need gc_heap objects to be POD types.

Every time memory allocation is requested, the gc_context goes through every gc_heap object it owns and asks for free contiguous space until it finds enough.

Once the memory is allocated, the bits are set to indicate that the memory is being used and the heap_starts flag is activated for the bit where the object starts. Note that the heap_starts is only for the unit where the object starts.
So if an object with a length 3 units is stored at position 2 and another with a length of a single unit is stored at position 6, here is what the two bitsets look like:

Index	1	2	3	4	5	6	7
heap_bitset	0	1	1	1	0	1	0
heap_starts	0	1	0	0	0	1	0

Because flagging every object as "not-marked" every time the GC algorithm runs is annoying, I decided to give each mark execution an "ID".
Every time the mark phase runs, it sets the "last mark id" of all active objects as the ID of the current execution.
Once sweep kicks in, it checks for objects with old IDs. When an old ID is found, then it knows that the object is unreachable and ready to be deallocated.
When an object is deallocated, the corresponding bitsets are once again set to 0.

Since the "last mark id" is an actual part of the object layout, marking an invalid object pointer means changing an incorrect part of the memory.
All objects in the stack and candidates for marking, but many are invalid, random, integers so, before marking an object, the GC checks that the object does belong to a heap and then that it is indeed a valid pointer. It does so using the following algorithm:

const gc_heap* gc_context::is_heap_object(void* obj, bool is_gc_object) const {
    if (obj < first_heap || obj >= last_heap) {
        return false;
    }

    for (const gc_heap& heap : heaps) {
        if (heap.contains(obj, is_gc_object)) {
            return true;
        }
    }

    return false;
}

Where heap.contains checks that the object is within the allocated space for that heap, is aligned with the unit size (which all valid allocations must be) and has its heap_starts bit set.
first_heap and last_heap are a fast check to exclude all pointers that known to be outside all heaps.

Also during the mark phase, there's the chance that an object might be referenced multiple times. The algorithm must detect that an object was already marked during that execution and skip it to prevent infinite loops.

In hindsight, the mark phase was way harder to implement than the sweep phase.

Arrays

Arrays are special types. They have a "content type" which may or may not be another array, and a length.
Arrays have the same memory layout prologue as classes, but they have a few extra fields. Notably, they include a pointer to their content.
In an array of length 10 with content-type "32-bits integer", the content points to a chunk of memory with 40 bytes, that contains the 10 integers.
In an array of length 10 with content-type "object of some type" or "array of some type", the content points to a chunk of memory that contains the references to the objects or arrays.

struct array_representation {
    core_representation core;
    size_t array_length;
    void* content;
};

In other words, arrays have two parts: The "header" and the "content", which are in two separate memory locations.
I initially considered merging the two but so far I have decided not to. I might revert that decision later.

Problems

Since this is low-level C++ programming, problems are sure to arise. I will list some of the most noteworthy here.

At some point, for long reference chains, such as large linked lists, the program would crash with segmentation fault. I managed to figure out that this was actually caused by a segmentation fault in the mark phase.
I had already suspected that the deeply recursive marking algorithm, if implemented in a language like C++, would cause trouble, so when I got this error I already had a pretty good idea of what could be causing it.
Consider a linked list of A->B->C. In that case the mark phase would mark A, then call a function to mark B and then call a function to mark C. If the linked list had thousands of elements, the call stack would grow very large.
This was solved by replacing the recursion with a queue of objects that must be marked. Without deep recursion, the stack overflow issue is avoided.

Another weird issue happened when I attempted to combine C++ and x86-64 Assembly. Doing so would cause the program to immediately crash:

extern "C" {
    extern __cdecl void* get_stack_pointer();
}

.text

.global get_stack_pointer

get_stack_pointer:
 movq %rsp, %rax
 ret

This was solved by adding some SEH-related lines.

.text

.global get_stack_pointer
.seh_proc get_stack_pointer

get_stack_pointer:
 .seh_endprologue
 movq %rsp, %rax
 ret
 .seh_endproc

I'm not entirely sure what the .seh lines do, but they make it work, so I'll leave them alone. I probably wouldn't have figured it out by myself without the help of gcc -S.

Another problem I had was an apparent memory leak. When testing my program with many large arrays, I noticed that the memory would grow a lot - sometimes to nearly 4 GiB.
This happened because the GC algorithm would only collect when the gc_heap objects were getting full. However, because the arrays themselves - the "headers" - were very small, the algorithm decided this was not worth the effort. At that point, the content of the headers was allocated in separate using a normal malloc call and completely ignoring the gc_heap objects.
Once the code was changed to have the array contents be allocated on the gc_heap objects, the problem was gone. It is worth noting that while array contents do set the heap_bitset, they have no effect on the heap_starts, which remains unset for the entire array contents.

Profiling and Optimization

Even when the garbage collector was working, it was somewhat slow. In order to make it faster, some optimization was needed.
Since I had very little clue on what was making it slow, I decided this was a good opportunity to improve my profiling skills.
The first thing I had to do was figuring out which profiler to use. I decided to try gprof, which comes with MinGW.
It turns out that using gprof with eclipse is not that hard. The following steps are required:

Enable gprof support in the project. To do so, go to Project Settings > C/C++ Build > Settings and then, in GCC C/C++ Compiler > Debugging, activate -p and -pg.
Run the program. A file "gmon.out" will be generated on exit.
Run gprof executable_name gmon.out > gmon.txt on the command line (I used the PowerShell).

A "gmon.txt" file will be produced. Open it and you'll find something like this:

 time   seconds   seconds    calls   s/call   s/call  name    
 47.71     79.01    79.01                             _mcount_private
 24.86    120.17    41.16                             __fentry__
  5.40    129.12     8.95 263627666     0.00     0.00  std::_Bit_iterator_base::_Bit_iterator_base(unsigned long*, unsigned int)
  2.45    133.17     4.05 3020571660     0.00     0.00  std::_Bit_const_iterator::_Bit_const_iterator(std::_Bit_iterator const&)
  1.64    135.88     2.71        2     1.36     1.37  test_array()
  1.46    138.29     2.41 1537955729     0.00     0.00  std::vector >::operator[](unsigned long long)
  1.36    140.55     2.26 1510285830     0.00     0.00  std::operator-(std::_Bit_iterator_base const&, std::_Bit_iterator_base const&)
...

This table indicates how long the program spent in each function, how many times that function was called and how long it spent per each function.
Stack Overflow has a long list of reasons explaining why gprof sucks, but it's still better than having no profiler at all.

The first two lines (_mcount_private and _fentry__) are apparently overhead from gprof itself and therefore not very interesting for us to study.

However, on earlier versions of this prototype, std::vector<bool> iterator-related functions were taking up a significant amount of time.
Because the size of the vector is not known at compile time, I could not switch to a std::bitset<N>. So I developed my own fast_bitset class.
Besides the usual get, set and unset operations, fast_bitset offers some functions to quickly find the next set bit, the next unset bit and to set or unset multiple bits at once. Instead of checking bit-by-bit, it often checks 32-bits at once, which (in a best-case situation and in theory) can make it 32-times faster.
And, indeed, once I replaced even just a few of the std::vector<bool> usages with it, times improved.

Multi-threading

At this point, the prototype does not use multi-threading at all. However, it should be possible to implement Web Workers-style multi-threading.
Web Workers do not actually share memory. They communicate with message passing. Unlike full threads, this type of threads should actually work with the current system.
This GC was designed to allow multiple instances of gc_context to coexist in the same program. As long as the gc_context is thread-local and that all memory allocations for a thread happen in that thread, it should work.
For full Java-like threads, the GC would not work as-is. The collector would most likely need to stop-the-world and some way to find out the contents of the stacks (and registers) of the remaining threads.

However, even for a single-threaded program, the collector itself could be modified to split work across multiple threads. Both the mark and the sweep phases should be easy to parallelize. The mark phase is based on a queue - which could probably be made thread-safe. The sweep phase could be divided so that each gc_heap sweep is run from a different thread. For N gc_heap instances, up to N threads could be used. This is not possible in the current prototype, but it should be trivial to add (whether it would perform well is an entirely different problem).

Potential future problems

I currently do not treat for registers as GC roots.
I always assume that all pointers in the stack are perfectly aligned.
Once a gc_heap object is allocated, it is never deallocated until the context itself is deallocated.

Future Work

There are multiple portions of this prototype that could use some improvements.
Here's my wish-list:

Static fields
First-class support for virtual methods
Support for C#-style structs and delegates.
Support for destructors and weak references.
Proper multithreading
Support for other processor architectures (currently, this is x86-64 only)

And, lastly:

Turning this into a full language runtime with the help of LLVM.

Conclusion

Mark and sweep is a simple algorithm that does not require too much voodoo to work properly.
Knowing the exact object layout helps the algorithm by reducing false positives from random integers that look like pointers, but it might not be worth the effort.
Writing a simple garbage collector is not that hard - seriously, it took me about as much time as writing this post. Of course, it won't necessarily be very efficient nor very feature complete.
That's all.

luiscubal's code sketchbook

Sunday, November 11, 2012

A simple Mark&Sweep Garbage collector