Saturday, October 26, 2013

Introducing NWarpAsync - Emulating yield return using await

When C# 5.0 arrived, it came with a brand new language feature - async functions.
Although asynchronous programming was already possible in previous versions of C#, it was cumbersome. Previous async patterns included the usage of methods such as BeginInvoke or Task.ContinueWith. These patterns were both based on callbacks and they were cumbersome to use.

With C# 5.0, those methods are (mostly) a thing of the past, and async/await is "The Next Big Thing".

I have designed a small library I called NWarpAsync that uses async/await for purposes that are not what it was primarily designed for. It's called "NWarpAsync" because it twists async. It starts with a "N" because, really, what open-source project for .NET doesn't?

Required Knowledge

In order to understand this post, you'll need to understand C#, including yield return and the new async/await features.

Implementing Yield with Await

The first (and so far only) feature of NWarpAsync is emulation of the yield return feature (available since C# 2.0) using await.
I got the idea from watching slides about using yield in node.js to emulate await and I wondered how easy it would be to implement the reverse.

This might not seem very compelling but this is mostly a proof-of-concept anyway and it does nicely support a few cases that C# yield does not.
Here is how it is used:
using System.Linq;
using NWarpAsync.Yield;

class Program
{
    static void Main()
    {
        foreach (int value in new Yielder(async yieldSink => {
            await yieldSink.Yield(1);
            await yieldSink.YieldAll(Enumerable.Range(2, 3));
            await yieldSink.Yield(5);
        }))
        {
            //Prints 1 2 3 4 5 (each number in a different line)
            Console.WriteLine(value);
        }
    }
}
As you see, using await as yield requires the use of a few helper classes/helper methods, but I believe the final result is still reasonably easy to understand.
YieldAll is equivalent to yielding all elements in an enumerable (or doing nothing if the enumerable is null). F# supports this (with the yield! operator), but C# has no short equivalent. Instead, we're stuck with if (x != null) foreach (var value in x) yield return value;

How it is works

The most important class of NWarpAsync.Yield is the Yielder<T> class. A Yielder<T> takes an asynchronous function (the "generator" or "builder" function) and makes it available as a plain old IEnumerable<T>.
The function that Yielder<T> takes in turn receives a parameter of type YieldSink<T>, which exposes the yielding methods. Most of the magic is done in Yielder and YieldSink.

Yielder.GetEnumerator starts the whole process. It creates an enumerator that executes a step in the function Yielder was constructed with every time MoveNext is called. Once the function stops executing (either because it has really returned or because it is awaiting the result of the Yield method), MoveNext checks if the function has yielded any values. If so, then that is value of IEnumerator.Current. If not, then the enumerator has finished, and MoveNext returns false.

YieldSink.GetAwaiter().OnCompleted registers the action that the next MoveNext will execute.

If the enumerator is disposed (e.g. with LINQ Take) before the function has finished executing, then an exception is thrown in the current execution point of the generator function to allow IDisposable resources to be cleaned up properly and finally clauses to execute.

Main Problems and Limitations

It took some effort to make exceptions work properly.
Initially, I made Yielder<T> receive an Action<YieldSink<T>> (as opposed to a function returning a Task). It turns out that unhandled exceptions are nasty for void asynchronous methods.
Switching to Task made this easier.

Unfortunately, the default behavior of Tasks means that the consumer of Yielder will never see "normal" exceptions. Instead, it'll always receive an AggregateException. This could probably be fixed, but I'm not sure it matters. I'll fix it when I'm convinced it's important.

The second problem occurs when the developer writes sink.Yield(value), but forgets to await it. The result is an ugly exception on the next Yield call which simply could not happen with the native yield. In other words, it's easier to get this approach wrong.

Advantages

If await was implemented back in C# 2.0 and there was a new proposal to add yield, I think this library would pretty much kill the arguments in favor of it. Yield can purely be a library feature if await is a language feature. Of course, it's the other way around and yield is here to stay, so that point does not hold.
That said, if language designers want to make a new language that includes await and are wondering if yield matters, I hope this can be helpful.

As I mentioned above, this library supports a F# yield!-like feature that C# does not have.

However, the biggest advantage is that, unlike yield, async/await can be used in lambdas. This means that, for the first time in C#, it is possible to make an anonymous generator function! This is what I did in the example I showed above.
You can even yield enumerations of enumerations using LINQ SelectMany.

Conclusion

Async/Await is a powerful language feature that allows a method to "pause" and then "resume" later.
Features like yield do the same. As such, it's not too hard to implement yield on top of async/await with the help of a proper library.

I'm curious to find out what other "missing" language features can be implemented using async/await. Perhaps some sort of asynchronous collection?

Further Reading

luiscubal/NWarpAsync
The official github repository.

Coroutine - Wikipedia, the free encyclopedia
Coroutines are special functions that can be suspended and resumed. Yield and Async/Await are two coroutine mechanisms.

yield (C# Reference)
The official documentation for yield in C#

await (C# Reference)
The official documentation for await in C#

await anything; - .NET Parallel Programming - Site Home - MSDN Blogs
A blog post explaining how to use await on objects that aren't Tasks.

How yield will transform Node.js
A blog post showing await on top of async in JavaScript.

Wednesday, October 16, 2013

Null Value Analysis

One of the features I implemented in NRefactory this summer of code was null value analysis. I'll describe here what it is, what it is useful for and what its limitations are.

Required Knowledge

  1. You'll need to understand basic programming concepts, such as how if statements work;
  2. It will help if you understand what directed graphs and control flow graphs (CFGs) are;

The Problem Null Analysis Tries to Solve

Null Value Analysis is a tool (meant to be used by code issues) that tries to predict if a given value is null or not.
Knowing if a value is null is necessary to detect some redundancies, null reference exceptions and unreachable code.

object o = null;
if (o == null) {
   //The if statement is redundant, this is always executed
} else {
   //Unreachable code. This never happens.
}
string s = o.ToString(); //NullReferenceException here
//Since the previous line throws an exception, the following
// code is unreachable.
Console.WriteLine("Unreachable");
Code issues can use this information to warn the user about potential bugs and redundancies.
There are currently two code issues that use null analysis: ConstantNullCoalescingConditionIssue (that detects code like always_null ?? something) and UseOfMemberOfNullReference (that detects always_null.SomeMember). There are other issues that could benefit from this, but that feature was not yet implemented.

Overview

Null analysis is based on the concept of Control Flow Graph (CFG). The CFG for the above method is:
What the null value analyser does is traverse this graph and figure out how each statement changes the local variables.
Let's see what happens:
  1. This method has no parameters, so it doesn't need to setup any parameter variables;
  2. It enters the method;
  3. It declares a new variable called "o" which is "definitely null";
  4. It checks the if (o == null) condition. Because both "o" and "null" are known to be "definitely null", the analyser concludes that the result of the expression is "known to be true";
  5. Because the expression is known to be true, it visits the if block, but not the else block;
  6. It leaves the if block (since it's empty);
  7. It checks the expression o.ToString(). Since ToString is not an extension method nor System.Nullable.HasValue, we know that for this statement to be successful, o must not be null. And yet, o is known to be null. So the null analysis considers this to throw an exception (and indeed it does - a NullReferenceException);
  8. The node is not in a try block, so we end the analysis.
Notice how nodes #8, #9 and #10 are not even visited.

Now, let's see how null analysis handles loops.

object o = null;
while (o == null) {
   Console.WriteLine(o == null ? "Foo" : "Bar");
}
//The end of the loop is never reached 
It can't just keep visiting the loop over and over again. It's an infinite loop, so the analysis would be infinite as well. Instead, it just revisits the loop if it hasn't already visited the loop once for the current set of variable states. But let's see how it behaves in this particular case:
  1. object o is defined as being "definitely null";
  2. Since o is definitely null (because it was declared as such in the previous line), and so is null (by definition), the loop condition is true. It hasn't visited the loop yet, so it'll do so for [o=DefinitelyNull].
  3. It visits Console.WriteLine. Because o is known to be null, the expression will always return "Foo".
  4. The loop ends. Because all variables (just o, in this case) are just the way they were when the loop began [o=DefinitelyNull], there's no need to visit it again.
  5. Note that the code after the loop is never visited, because the condition is always true.
Now consider the following:
object o = null;
object p = null;
while (o == null) {
   Console.WriteLine(p == null ? "Foo" : "Bar");
   p = 1;
}
//The end of the loop is never reached
In this case, each node in the loop is visited twice. The first time for "p is definitely null" and the second time for "p is definitely not null".

The analyser keeps a dictionary (map) containing the local variable states. For instance, in the example above, the contents of the loop are visited once with the state [o=DefinitelyNull p=DefinitelyNull] and a second time with the state [o=DefinitelyNull p=DefinitelyNotNull].
When the node is visited, the analysis tries to figure out what the state of the variables will be after it.

For instance, when the analyser finds the expresssion p=1, it modifies the state dictionary to have [p=DefinitelyNotNull] (while keeping the remaining variables unchanged).

Expressions

Initially, the plan was to have a pure null value analysis. That is, everything was treated being null, not-null or perhaps-null-not-really-sure.
For variables, this turned out to be reasonable in many cases. For expressions, it wasn't.
The problem are expressions such as (x == null ? x : 1). What does that do? Well, if x is null, it returns null. If x is not null, then it returns a value that isn't null. More to the point, the behavior depends on whether "x == null" is true or false.
The analyser can only handle an expression like this if it keeps track of what's true and what's false. As such, I eventually decided that even if only the null status of variables should be stored in the state, keeping track of the boolean value of expressions was essential.
Not only that, the analyser detects that expressions such as "x == null" can only be true if x is null. So even if the value of x is not known at all, it knows that the code in if(x == null) will only be executed if x is null. So the code embedded in the if statement will have [x=DefinitelyNull] even if x was not previously known.

Another problematic case are lambdas. In C#, lambdas are capable of capturing variables.
Consider this code:

object o = null;
Action action = () => { o = 1; };
Foo (action);
//Some more code goes here
So what will be the value of o later in this function? We don't know. It might be null if action is executed. But when will action be executed? We don't know. Maybe imediately, maybe after another method is caled, maybe randomly, by another thread. Again, we don't know.

In these cases, the analyser pretty much gives up and classifies o as captured unknown. Not only it doesn't know the value, it won't ever know the value. Even if we set the variable later, it'll still remain captured.
The only way for a captured variable to become uncaptured is to go out of scope, and then declared again. Also, C# 5 foreach loops uncapture the variable every time the loop starts.
Note that although variables can be captured unknown, expressions can not. The result of an expression with a captured variable is just unknown. If we have a variable named "x" that is captured and run "y = x;", then the value of y is unknown, but it isn't captured. We can always do "y = null;" and it'll become DefinitelyNull, unlike captured variables.

When the algorithm finds ||, &&, ?: or ?? conditions, it tries to exclude operands that are known to never be executed (e.g. y in x ?? y, when x is DefinitelyNotNull). When that is not possible, it tries to "merge" the results of the expression. For instance, (x == null ? null : (x = null)) gives us a result of null and [x=DefinitelyNull] because that's what happens in both outcomes.

Query expressions are more complicated, since we need to consider the possibility of "from x in emptyEnumerable".

Semantics and Annotations

Unfortunately, null value analysis works poorly across functions.

Consider the following code:
int? Foo() {
   return 1;
}
void Bar() {
   //Assume this is the function being analysed
   var x = Foo();
   var y = x ?? 1;
}
In this case, x is never null, so the null coallescing statement is useless.
However, null value analysis doesn't know that. It'll treat "x" as an unknown value.

Fortunately, there's a way around that. Although null value analysis can't figure out what Foo returns, we can help it with annotations.

Null Value Analysis supports a subset of JetBrains.Annotations. These annotations are C# attributes that can be used to provide hints about the possible return values of methods and delegates. Additionally, they can also be applied to fields, properties, delegates and parameters (including out parameters).
[JetBrains.Annotations.NotNull]
int? Foo() {
   return 1;
}
void Bar() {
   //Assume this is the function being analysed
   var x = Foo();
   var y = x ?? 1;
}
In this modified version, the analyser will trigger the warning.

You can use JetBrains' assembly or you can write your own attribute classes. All the analyser cares about is the full class name of the attribute. If you get that right, then it doesn't matter where it comes from.

Annotations can also be used to identify a method as an assertion. Assertions are used to indicate that the program can only continue operating correctly if certain conditions are true -- it is very common to use assertion methods in unit tests.

using JetBrains.Annotations;

class TestClass
{
   [AssertionMethod]
   void Assert([AssertionCondition(AssertionConditionType.IS_TRUE)] bool condition) {
   }
   void TestMethod(string x)
   {
      Assert(x != null);
   }
}
In this case,the analyser will assume that x is not null after the assertion. Normally, the assertion method will throw an exception or abort the program when the condition fails.

Main Limitations

Throughout this post I've mentioned some of the problems of null value analysis. Still, this is important enough to deserve a dedicated section.

First of all, as I mentioned above, null value analysis does not keep track of the value of variables aside from null/not-null.
In particular, this means the analyser is clueless when it comes to recognizing patterns such as this:
bool x = true;
object o = null;
if (x) {
   o = "";
}
In this example, o will always end with the value "", but the analyser doesn't know that. It doesn't even recognize that o can't end up null. Because it doesn't keep track of the value of x, so it must assume that there's a chance the if statement won't be entered.
This limitation only applies to variables. The analyser does detect when specific expressions always return true.
I am convinced that boolean values could be added to the analyser. But the algorithm would be too slow for integers and even ocasionally get stuck in infinite loops for strings.
The reason this would happen is because every time a previously visited node is reached more than once (this happens e.g. in loops) the analyser must decide whether to re-visit the node. For that, it checks what values the variables in scope had the last times that node was visited. If the variables are the same, nothing changed, so there's no point in visiting the node again. If the variables changed, then revisiting the node is necessary for correctness.
This works for null/not-null because there are only two possible values. Even considering states such as "Maybe Null" or "Unknown, Captured by Lambda", the total number of states remains low. Moreover, many states "include" others. That is, if a node was visited for [x=MaybeNull], it won't be revisited for [x=DefinitelyNull], because [DefinitelyNull] is included in the [MaybeNull] scenario.
This means that there is an upper bound to the number of times a node can be visited (depending on the number of variables in scope and the possible values a variable can have). Most functions have few variables changing values all the time.
For booleans, everything is still okay. But for integers, the number of possible values a variable can have goes up and, with it, the number of times a node could be revisited. A lot.

int i = int.MinValue;
while (i != int.MaxValue) {
   ++i;
}

This small function triggers a sore spot in an hypothetical analyser that tracks the value of integers. The null analyser isn't too bothered by this loop. It only enters it once.
But if the analyser kept track of the value of i, then it would detect that the value of the loop has changed every iteration. Because of that, it'd revisit the loop. But that would change the value again. So the analyser would visit the loop again. And again. And again. Around 2^32 times.

Strings have infinite-for-all-practical-purposes possible combinations, and as such would trigger infinite loops in the "smarter" analyser.

For integers and strings, a completely different algorithm would be required.

The second major limitation is the way expressions are handled when it comes to control flow.
The Null Value Analyser is based on NRefactory's built-in Control Flow Graph classes. Unfortunately, these classes are very statement-centric and do not handle expression-level control flow.
In other words, the CFGs know how to treat if statements, but not ? : expressions.

My solution was to patch a limited form of expression control flow detection on top of the statement CFG.
Whenever an if statement is encountered, the analyser handles the condition, then schedules two nodes to be visited -- the "true" case node, and the "false" case node. Two visits.
Whenever || is encountered, the analyser handles the left side, then handles the right side, then combines the result of both sides and continues. When the expression statement is over, the next node is scheduled to be visited based on the combined result. One visit.
This means the analyser is faster but less accurate for expressions than for statements.
if (x == null || y == null) {
   if (x != null) {
      // y is always null here, but the analyser doesn't know that,
      // because the top if statement only saw the combined result
      // of the || expression.
   }
}

The third major limitation is related to missing method annotations. Specifically, Null Value Analysis does not implement the complex ContractAnnotationAttribute. This attribute would allow programmers to give the analyser highly specialized hints, such as "this function never returns true if the parameter is null".
Because Null Value Analysis doesn't support any equivalent annotation, methods such as string.IsNullOrEmpty can't be supported properly.

The final major limitation is the lack of external annotations.
The analyser currently only recognizes annotations in the source code itself. For third-party code, the only way to add annotations is to modify the source code. This is undesirable (not to mention highly impractical) and external annotations are definitely on the wishlist, but nothing is currently available.
In short, most libraries (including standard .NET libraries) are currently hostile to the analyser.

Final Words

I hope null value analysis ends up being useful. It can certainly work reasonably well in theory and in a few test projects (notably NRefactory, MonoDevelop and xsp) but the world of code is a lot vaster than that, so it's still unknown if it'll survive under the harsh light of real-world usage. Time will tell, I guess.

This year's GSoC is over and I passed the final evaluation. Overall, I think null analysis was the hardest thing I wrote.
I enjoyed working in MonoDevelop and NRefactory and I'll certainly try to contribute again, if I have some time to spare. I actually learned a few new C# features and I got exposed to some patterns and paradigms I hadn't given much though before.

And that's it for today.

Further Reading

Official NRefactory github repository (icsharpcode/NRefactory)
This is where the upstream version lives.

Summer of Code 2013 NRefactory github repository (mono-soc-2013/NRefactory)
This is where the students push their code to before trying to get it upstream.

Control flow graph - Wikipedia, the free encyclopedia
Wikipedia explanation of what a CFG is.

Halting problem - Wikipedia, the free encyclopedia
Undecidable problem (as in, problem for which an exact solution is known to be inexistent) on computers. A perfect null value analyser would be capable of solving the halting problem and, as such, is impossible to make.