Sunday, April 21, 2013

Asynchronous OpenGL texture loading with C# and OpenTK

Games often have a large quantity of resources to load. When streaming is not an option (or not a good option), loading screens are one common solution.
Loading screens should be as short as possible (to reduce player frustration) and should include some sort of animation to let the player know that the game isn't frozen.
This post focus on two questions:
  • How can we take advantage of multi-core processors to reduce loading times?
  • How can we create/load OpenGL textures (and other OpenGL resources) and render something to the screen at the same time?
We'll be exploring the following methods to implement loading screens:
  • Synchronous texture loading.
  • Single-threaded alternating texture loading.
  • Helper loading thread.
  • Multi-threaded loading with Parallel.ForEach.
  • Task-based loading with TaskCompletionSource or a custom TaskScheduler.

Required Knowledge

  • The code is written in C#, so basic knowledge of C# is necessary.
  • Basic OpenGL knowledge would help.  We'll be using OpenTK in this post - a .NET library that allows C# programs to use OpenGL (and a few other APIs as well).

Synchronous Texture Loading

First, we'll explore how a synchronous texture loading system could work, its advantages and its shortcomings.
To create a texture in OpenGL, we need to follow these steps:
  1. Read the contents of the texture file
  2. Decode them
  3. Create a new OpenGL texture name (see glGenTextures).
  4. Send the texture data to OpenGL with glTexImage2D.
  5. Set the image parameters.
  6. Release the data of the texture file we loaded.
To load and decode textures, we'll use the System.Drawing.Bitmap class.
using (var bitmap = new Bitmap(filename)) {
    width = bitmap.Width;
    height = bitmap.Height;
    textureId = GL.GenTexture();
    GL.BindTexture(TextureTarget.Texture2D, textureId);
    var bitmapData = bitmap.LockBits(
        new Rectangle(0, 0, bitmap.Width, bitmap.Height),
        System.Drawing.Imaging.ImageLockMode.ReadOnly,
        System.Drawing.Imaging.PixelFormat.Format32bppArgb);
    try
    {
        GL.TexImage2D(TextureTarget.Texture2D, 0, PixelInternalFormat.Rgba,
            bitmap.Width, bitmap.Height, 0, PixelFormat.Bgra, PixelType.UnsignedByte,
            bitmapData.Scan0);
    }
    finally
    {
        bitmap.UnlockBits(bitmapData);
    }
    GL.TexParameter(TextureTarget.Texture2D,
        TextureParameterName.TextureMinFilter, (int)TextureMinFilter.Nearest);
    GL.TexParameter(TextureTarget.Texture2D,
        TextureParameterName.TextureMagFilter, (int)TextureMagFilter.Nearest);
}
This approach is simple. And simplicity is a feature.
The main problem of this approach is that it is blocking. While the textures are being loaded, nothing else can be executed. That includes rendering.
Until all textures are loaded, the game freezes. This is undesirable.
I've tested this method to load 1500 textures and it took around 2200ms. That's two seconds waiting for something to happen with no visual feedback.

Loading Screen

So now the big question is how to render and load at the same time.
One possible answer is to load a few textures per frame.
For instance, if we have 1500 textures to load, we could load 50 each frame and render normally.
After 30 frames, the loading would be over.

This approach works because we never load too many textures in each frame (and each texture individually takes only a short time to load) so the loading screen is somewhat smooth.
Since it's just a loading screen, it probably won't have anything too heavy anyway and frame rate drops are acceptable.

I haven't measured this method, but I'd expect it to take a bit longer than the blocking one.

But we still have one remaining question: How can we use multi-core processors to speed this up?

Concurrency

There are many ways to implement concurrency.
The best known method is probably multi-threading.
Essentially, if you want to execute three tasks at once you create three threads. Each thread then executes a task.
All running programs have at least one thread.
The thing about threads is that aside from sharing memory, they are essentially independent of each other. This is, of course, a feature. But thinking about multiple pieces of code executing at the same time tends to be harder than thinking about the equivalent single-threaded code, since threads introduce several brand new types of errors.
As a result, it is currently harder to write correct multi-threaded programs than to write correct single-threaded programs.
So why bother?
While the usefulness of threads is somewhat limited (but still existent) in single-core processors, it's on multi-core (or at least hyper-threaded) processors that threads really shine.
Typically, the operating system will assign threads to processor cores. Often, we'll have different threads running on different cores. It's when this happens that we get true parallelism (as opposed to preemptive multitasking). In theory, parallelism can speed up programs up to N times (where N is the number of cores). EDIT: In some rare cases, it can be higher than N - See super-linear speedup.
In practice, however, not all tasks can be parallelized and multithreading adds some overhead - so the speed up is typically smaller than that. How much smaller? It depends. In some cases, the multi-threaded program might actually be slower than the single-threaded one.

So how do we use threads to improve our loading screen?
The first option is to create a helper thread that does the loading, while our main thread is rendering.
This approach essentially offers very little advantage over our load-few-textures-per-thread.
  • It can only use up to two cores - so if we have more than that tough luck.
  • Only one core does the loading. We can expect one of the cores to be significantly busier than the other one. So our program won't be anywhere near twice as fast as the first one. I haven't really tested this approach, but I expect it to take around 2200ms as well (the time the blocking method takes). Perhaps a bit longer, due to the overhead of multi-threading.
The first issue we'd encounter if we tried this method is that OpenGL would refuse to load our textures from the loader thread.
To understand why, we need to understand what OpenGL contexts are.

OpenGL contexts

A context represents the state of OpenGL and all sorts of info that OpenGL needs to know in order to do anything.
That means no OpenGL command can be executed without a context.
Many libraries isolate the creation of a context from the programmer so that OpenGL can be used in blissful ignorance.
All that bliss comes crashing down when threads are introduced.

Each thread can only have one context (called the current context) and each context can only be current to one thread. A context doesn't necessarily have to be the current context of any thread, but a context that's never made current is worthless.
Changing the current context of a thread is an expensive operation and, as such, should be avoided as much as possible.

Contexts can be independent of each other or they can share resources such as textures. By default, context sharing is enabled in OpenTK.

So for our loader thread proposal to work, we'd have to create a new OpenGL context and make it current for the loader thread. Then it could work.
But at this point, we still don't take full advantage of multi-threading.

Parallel.ForEach

Threads incur overhead. That much is a fact. This overhead can be mitigated but never eliminated.
This overhead is acceptable when the gains are enough to exceed it. But the gains of multi-threading are limited.
Let's simplify the problem and ignore the whole mess of I/O bound vs CPU bound and hyperthreading. Then we can say that if there are more threads than processor cores, the threads can't be all run in parallel. We get some parallelism, but the operating system has to use other methods to ensure that all threads get to run.
Then we still pay the cost of threads, but performance starts to degrade.
So let's say using 10 threads gives us the best result, but we have 1000 textures to load. How do we split work across the 10 threads?
The obvious answer is "let each thread load 100 textures each".

But not all textures necessarily take the same time to load. One small texture will typically be faster to load than one large texture. What is a thread gets 100 fast textures and one thread gets 100 slow textures? Clearly, we'd not be taking advantage of threads properly.
And how do we even know 10 threads are the best?

These are all hard problems.
In .NET, this problem was solved with the introduction of the Task Parallel Library. There's no need to reinvent the wheel in this case.

Here's how the code would look like:
//Sequential code
foreach (var textureToLoad in texturesToLoad) {
    LoadTexture(textureToLoad);
}

//Parallel code
Parallel.ForEach(texturesToLoad, textureToLoad => {
    LoadTexture(textureToLoad);
});
This example assumes, of course, that LoadTexture is thread-safe. In my own code, I decided to store all loaded textures in a Dictionary. However, Dictionary is not thread-safe, so I had to replace it with a ConcurrentDictionary. Always be careful about that sort of things when writing multithreaded code.
Although we could rewrite this example using PLINQ Select, I've decided to stick to Parallel.ForEach because we're not done yet. We're going to use a different Parallel.ForEach overload - and I'm unaware of any PLINQ equivalent.

The first problem is, once again, OpenGL contexts. Since Parallel.ForEach may use multiple threads, we need to set a new context for each thread. And, of course, destroy the context when we're done.

The second problem is that Parallel.ForEach is blocking. Even though it runs in parallel, it still makes our main thread wait for all the worker threads to finish their work. We do not want this.
Here's the fixed code:
//We'll be covering what the Task class is and what it does soon.
Task.Factory.StartNew(() =>
{
    Parallel.ForEach(texturesToLoad,
    () =>
    {
        //This is the code to initialize each new thread
        var window = new GameWindow(width, height, graphicsMode, "", gameWindowFlags);
        window.MakeCurrent();

        return window;
    },
    (textureToLoad, state, window) =>
    {
        //This is the code to load each texture
        LoadTexture(textureToLoad);

        return window;
    },
    (window) =>
    {
        //This is the code to finalize each thread
        window.Dispose();
    });
});
This solves our problem and takes advantage of parallelism to achieve faster results.
In my own tests, loading 1500 textures (the same textures we've loaded before) takes almost 2000ms. This is faster than our previous code but still somewhat disappointing.
We can do better than this.

Tasks and continuations

We've used Task.Factory before, but now it's time to really cover what it is.

Essentially, a Task is a unit of work. The best part about using tasks is that it opens up the way to use C# 5 syntax sugar that makes asynchronous programming less painful to use.
Though the prospect of using syntax sugar in the future is appealing, we will not actually be doing that today.
Today, we're going low-level.

So, how do we create a task anyway?
We've created a task in our example above using Task.Factory.StartNew. This is the simplest way to do it.
Task.Factory.StartNew(() => {
    Console.WriteLine("Hello from task");
});
The example above "just works". Once that code is executed, it'll create a task that's (potentially) executed in parallel.
Notice that the task is started as soon as it's created. This behavior is not strictly required. It's possible to construct a Task that's not immediately started with "new Task(action)", but by convention all functions that return tasks start them unless there's a really good reason not to.

At this point, Tasks don't do much we couldn't already do with Parallel.ForEach, with the added inconvenience that we can't set the OpenGL context of each thread.

For tasks to become more useful, we first need to introduce the concept of task continuations.
The purpose of task continuations is to allow performing some action only after a task has been completed.
Task.Factory.StartNew(() => 3)
    .ContinueWith(task => Console.WriteLine(task.Result));
In the example above, first we launch a task that returns 3 and we indicate that when that task is completed, we want to print the result in the screen.
With this, we can create tasks that depend on other tasks.
Task.Factory.StartNew(() => 3)
    .ContinueWith(task => {
        Task.Factory.StartNew(() => {
            Console.WriteLine(task.Result);
        });
    });
This example launches a second task after the first one has been completed. We'll soon see why this matters.

Loading textures in two steps

We'll now be trying a different approach to load our textures.
Up until now, we've been trying to parallelize the entire loading process. Now, we'll instead parallelize only a portion of the work.
Because we're parallelizing less work, we could expect performance to degrade but the opposite seems to happen.

Let's go back to the steps required to load a texture:
  1. Read the contents of the texture file
  2. Decode them
  3. Create a new OpenGL texture name.
  4. Send the texture data to OpenGL.
  5. Set the image parameters.
  6. Release the data of the texture file we loaded.
The most expensive steps are the first two. What if we parallelized only these and kept the remaining ones single threaded?
foreach (var textureToLoad in texturesToLoad)
{
    var baseTask = new Task<Bitmap>(() =>
    {
        var bitmap = new Bitmap(textureToLoad.AsFilename());
        return bitmap;
    });

    baseTask.ContinueWith(task =>
    {
        ScheduleToBeLoadedInMainThread(() => {
            var bitmap = task.Result;

            LoadTextureFromBitmap(bitmap);

            bitmap.Dispose();
        });
    });

    baseTask.Start();
}
For each texture to load, we launch a new task. Once that task is completed, we run a second operation, which schedules the bitmap to be loaded later by the main thread.

Now, we combine this approach with our alternating loading approach.
We go to UpdateFrame method and execute up to N actions (load up to N textures). The exact value of N depends on how much work you're willing to let each individual frame do. In my tests, I've used N=50.
We'll soon see how we can schedule work to be run in main thread.

Right now, we'll focus on a different problem. How to tell when we've finished loading.
If we're in a loading screen, we want to know we've finished loading so we can move on to the next screen.
For this, I've used a counter that indicates how many operations are left. When this counter reaches 0, we know we're done. Because the counter is read and modified by multiple threads, we must be careful and use atomic operations as necessary.
int counter = 0;
var baseTasks = new List<Task>();
foreach (var textureToLoad in texturesToLoad)
{
    var baseTask = new Task<Bitmap>(() =>
    {
        var bitmap = new Bitmap(textureToLoad.AsFilename());
        return bitmap;
    });

    baseTask.ContinueWith(task =>
    {
        ScheduleToBeLoadedInMainThread(() => {
            var bitmap = task.Result;

            LoadTextureFromBitmap(bitmap);

            bitmap.Dispose();

            if (Interlocked.Decrement(ref counter) == 0)
            {
                //We have no further textures to load
                ready = true;
            }
        });
    });

    baseTasks.Add(baseTask);
}

counter = baseTasks.Count;

foreach (var baseTask in baseTasks)
{
    baseTask.Start();
}
This is good and all, but the ready variable does not really allow the user of this code to fully take advantage of tasks. For instance, what if we want to add a continuation to run after we've finished loading our threads?

TaskCompletionSource

TaskCompletionSource (TCS, for short) is enables a whole new way to create tasks.

Up until now, we would start a new task by giving .NET a new action to run. The code would execute and then the task continuations would be executed.

With TCS, we tell C# "Hey, we're doing some stuff. We'll let you know when we're done and what the result was."
So while the tasks we've used up until now were somewhat active - they decided when they were done - the TCS paradigm is more passive - we decide when they are done.

TCS allows us to run code using a completely custom mechanism, and notify C# when we're done to allow continuations to be executed as they normally would.
public static Task WaitOneSecond() {
    var tcs = new TaskCompletionSource<object>();

    new Thread(() => {
        Thread.Sleep(1000);
        tcs.SetResult(null);
    }).Start();

    return tcs.Task;
}
Let's study the example above.
This function creates a new pending task. Then, we create a new thread that sleeps for one second and lets C# know when the waiting time is over.
We use TaskCompletionSource<object> because C# doesn't offer a non-generic TaskCompletionSource. TaskCompletionSource<object>.Task gives us an instance of Task<object> which can, fortunately, be implicitly cast to Task.
Note that we don't really "start" the task because it's up to us to decide how and when the task is executed.

We could use this function like this:
WaitOneSecond().ContinueWith(task => { Console.WriteLine("We're done!"); });
This code snippet waits one second and then prints a message on the screen.

In order to use TCS in our texture loader code, we only need to make very small adjustments.
(EDIT
int counter = 0;

var tcs = new TaskCompletionSource<object>();

var baseTasks = new List<Task>();
foreach (var textureToLoad in texturesToLoad)
{
    var baseTask = new Task<Bitmap>(() =>
    {
        var bitmap = new Bitmap(textureToLoad.AsFilename());
        return bitmap;
    });

    baseTask.ContinueWith(task =>
    {
        ScheduleToBeLoadedInMainThread(() => {
            var bitmap = task.Result;

            LoadTextureFromBitmap(bitmap);

            bitmap.Dispose();

            if (Interlocked.Decrement(ref counter) == 0)
            {
                //We have no further textures to load
                tcs.SetResult(null); //Tell TCS we're done!
            }
        });
    });

    baseTasks.Add(baseTask);
}

counter = baseTasks.Count;

foreach (var baseTask in baseTasks)
{
    baseTask.Start();
}

return tcs.Task;
And we're done.

There's one last missing piece. The ScheduleToBeLoadedInMainThread function.
This function can be implemented in many ways. We'll see one way that's easily reusable by other related classes (that also need to run code in the OpenGL thread) and integrates nicely with other tasks.

TaskFactory and TaskScheduler

First of all, how would we implement ScheduleToBeLoadedInMainThread with TaskCompletionSource? We'd keep a list of actions to execute, then provide a method to add a new action to this list and return a TaskCompletionSource.
Our UpdateFrame method would take elements from this list, execute the action and then notify the TCS.
Now we need to keep track of actions and their respective completion sources.
This works and it's simple. However, we'll cover a different approach here.

So far, we've created new tasks with the Task constructor or Task.Factory.StartNew(action).
Let's see what factories are with further detail.

A TaskFactory is exactly what the name tells us it is: A factory that creates tasks.
This is useful when we want to create many tasks with a bunch of shared special properties
var taskFactory = new TaskFactory();
taskFactory.StartNew(() => { Console.WriteLine("Do something"); });
This example is not particularly impressive. We're not doing anything we couldn't do before.
The thing is, now we can change the properties of the factory and let all created tasks share those same properties.
The one we're going to change is TaskScheduler.

A scheduler is a class that queues tasks and controls when and how they are executed. We'll be subclassing TaskScheduler.

We want our TaskScheduler to store the pending tasks and then allow its owner to decide when to execute a task.
class UpdateTaskScheduler : TaskScheduler
{
    ConcurrentQueue<Task> tasks = new ConcurrentQueue<Task>();

    protected override void QueueTask(Task task)
    {
        tasks.Enqueue(task);
    }

    protected override bool TryExecuteTaskInline(Task task, bool taskWasPreviouslyQueued)
    {
        return false;
    }

    protected override bool TryDequeue(Task task)
    {
        return false;
    }

    protected override IEnumerable GetScheduledTasks()
    {
        return new List(tasks).AsReadOnly();
    }

    public override int MaximumConcurrencyLevel
    {
        get
        {
            return 1;
        }
    }

    bool ExecuteNextTask()
    {
        Task task;
        if (tasks.TryDequeue(out task))
        {
            bool result = base.TryExecuteTask(task);
            return true;
        }
        return false;
    }

    public static TaskScheduler CreateNew(out Func executionFunction)
    {
        var scheduler = new UpdateTaskScheduler();
        executionFunction = scheduler.ExecuteNextTask;

        return scheduler;
    }
}
Let's see what this does.
First, we offer a mechanism to queue tasks. We disable task dequeuing so once a task is queued, it must run.
We also disable task inlining. Check the Further Reading section for more details.
The GetSchedulerTasks() method is used for Visual Studio debugging. We just return our list of pending tasks.
MaximumConcurrencyLevel indicates that only one single task from this scheduler is ever running at any given point in time.
Our ExecuteNextTask function executes a task. It returns true if it did manage to execute a task, false if there were no more tasks to execute.
We offer a CreateNew static method to create new schedulers. The out delegate allows the caller to use the ExecuteNextTask method of its scheduler.

To call the tasks in our main thread, we first create the scheduler and corresponding factory:
UITaskScheduler = UpdateTaskScheduler.CreateNew(out executeTask);
UITaskFactory = new TaskFactory(new CancellationToken(),
    TaskCreationOptions.HideScheduler,
    TaskContinuationOptions.HideScheduler, UITaskScheduler);
We add the following code in our UpdateFrame method to execute the tasks.
for (int i = 0; i < 50 && executeTask(); ++i)
{
}
Now, we're ready to complete our texture loader.
var uiFactory = Application.CurrentApplication.UITaskFactory;

int counter = 0;

var tcs = new TaskCompletionSource<object>();

var baseTasks = new List<Task>();
foreach (var textureToLoad in texturesToLoad)
{
    var baseTask = new Task<Bitmap>(() =>
    {
        var bitmap = new Bitmap(textureToLoad.AsFilename());
        return bitmap;
    });

    baseTask.ContinueWith(task =>
    {
        uiFactory.StartNew(() => {
            var bitmap = task.Result;

            LoadTextureFromBitmap(bitmap);

            bitmap.Dispose();

            if (Interlocked.Decrement(ref counter) == 0)
            {
                //We have no further textures to load
                tcs.SetResult(null); //Tell TCS we're done!
            }
        });
    });

    baseTasks.Add(baseTask);
}

counter = baseTasks.Count;

foreach (var baseTask in baseTasks)
{
    baseTask.Start();
}

return tcs.Task;
And that's it.

Final Adjustments

We have one final problem.
Running the code above may cause an OutOfMemoryException.
The problem is that we're loading lots of Bitmaps to memory, but we're not exporting them to OpenGL and disposing them fast enough.
We could try to set MaximumDegreeOfParallelism, but that wouldn't solve the problem. MaximumDegreeOfParallelism merely tells us how many tasks can be run at the same time, not how many tasks we can have pending in our UI scheduler.
One possible solution uses semaphores. We create a new semaphore (count=100) when the start the texture loading method, use semaphore.WaitOne() before allocating a new Bitmap and semaphore.Release() after bitmap.Dispose().
That way, there are never more than 100 bitmaps in memory waiting to be loaded to OpenGL.(EDIT: I'm not sure if this is a good idea. Semaphores are blocking, and tasks should generally avoid using blocking methods).
And that's it.

Conclusion

In my own tests, this final proposal takes 1100ms. That's twice as fast as the first approach.
While one second is not that long, this approach could be further tweaked to support slower use cases.
Also, we now have a framework to allow easily mixing worker threads and UI calls with C#'s TPL. Since we've used C# Tasks, our code is now compatible with the new C# 5 syntax sugar, so we could await our texture loading.
That's all.

Further Reading

No complete source code repository this time, sorry.
Still, the relevant code is all in the post, so you should still be able to use it. The code in this post was written by me (Copyright (C) Luís Reis, 2013) and it's MIT licensed.

Amdahl's law - Wikipedia, the free encyclopedia
The reason parallelism does not linearly improve the speed of most applications.

The Open Toolkit Library | OpenTK
The library the code in this post uses for OpenGL calls.

c# - What is the difference between task and thread? - Stack Overflow
If you're having trouble figuring out the difference.

Parallel.ForEach Method (System.Threading.Tasks)
The many ways to run a foreach loop in parallel in .NET.

Task Parallelism (Task Parallel Library)
Explains multiple task-related concepts, including some concepts not covered here such as task cancellation, child tasks, nested tasks, WhenAll and WhenAny.

Asynchronous Programming with Async and Await (C# and Visual Basic)
Shows how the new syntax sugar helps with task-based programming

Task.Wait and "Inlining" - .NET Parallel Programming - Site Home - MSDN Blogs
An overview of what task inlining is and how it works.