Monday, December 23, 2013

Comparison of two worlds - Await and Yield in C# and JavaScript

A few months ago, I published the source code of NWarpAsync.Yield, a C# library that emulated "yield" (a C# language feature) on top of "await" (another C# language feature).
This project is possible because await and yield are essentially the same feature. This is something some members of the JavaScript developer community noticed before I did and have exploited it to their benefit: JavaScript (as of ECMAScript version 6, which is still not finalized) already enables C#-style simple asynchronous programming.

I'll cover the main differences between the two approaches here, as well as why the JavaScript approach wouldn't work as nicely in C#, even though something similar would be possible. I do have a working prototype showing how it can be done (check the first link in the "Further Reading" section).

Required Knowledge

In order to understand this post, you should understand what generator functions are. Knowing C# and JavaScript will help greatly.

Yield in C#

I'd like to start this post by pointing out that yield in C# is not identical to yield in JavaScript. Unfortunately, as we'll see, yield in C# is less flexible than yield in JavaScript.

In C#, generator functions are those that return IEnumerable, IEnumerable<T>, IEnumerator or IEnumerator<T>. For the purposes of this post, we'll focus on IEnumerable<T>.
For those who are not familiar with C# terminology, an IEnumerable<T> is an object that can be iterated on.

Here's what a typical C# generator function looks like:
public IEnumerable<int> GetValues(int count)
{
   for (int i = 0; i < count; ++i)
   {
      yield return i;
   }
}
This method can now be used like any other method.

Enumerables in C# are classes that declare a method named "GetEnumerator" that in turn returns an IEnumerator<T>. "Enumerator" is the C# name for "iterator".

It is possible to iterate multiple times on the same instance returned by our GetValues method. Every time GetEnumerator is called, the function starts from scratch.

IEnumerator itself is an interface. The most important members of this interface are MoveNext and Current.
When MoveNext is called, the function is executed up to the point of the next "yield return" (or the end of the method, if it there aren't any more "yield return" statements).
MoveNext returns true if the generator reached a "yield return" statement, false if it reached the end.

The value returned by the "yield return" statement can be read by the "Current" property.

This has two important implications:
  1. No code in the function is executed until "MoveNext" is first called;
  2. The code is lazily evaluated, meaning that if, after a yield return statement is executed, MoveNext is never called again, then no more code is executed.

The second implication means we can get away with generators that return infinite collections without having the application hang, provided that the caller stops calling "MoveNext"(which will always return true in this case) at some point.
public IEnumerable<int> GetAllIntegers()
{
   // This method is valid.
   // Just don't use it in a foreach loop without a break statement.
   int i = int.MinValue;
   for (;;)
   {
      // Wraps around back to negative numbers on overflow.
      yield return unchecked(i++);
   }
}
So let's review the contract associated with yield in C#:
  1. The caller calls a method that returns an IEnumerable
  2. The caller calls enumerable.GetEnumerator()
  3. The caller calls enumerator.MoveNext() followed by enumerator.Current as many times as it wants to.
To handle finally blocks, the IEnumerator<T> interface also defines a Dispose method. When the Dispose method is called, the code in the applicable "finally" blocks is executed and the enumerator is considered terminated.

Yield in JavaScript

Yield in JavaScript is recent. So recent, in fact, that the standard it is described in is still just a draft.

Let's take a look at what yield in JavaScript looks like:
function * generator(count) {
   for (var i = 0; i < count; ++i) {
      yield i;
   }
}
Dynamic typing aside, the main differences we can see so far between JavaScript and C# are:
  1. Generator functions in JavaScript must be explicitly marked as such with the "*" after the "function" keyword (in C# it was inferred by looking at the function body):
  2. "yield" is used instead of "yield return".

However, the two languages do not have identical semantics.
In C#, generator functions can either return an enumerable or an enumerator.
In JavaScript, the generator function always returns the iterator (enumerator) directly.

The iterator is an object with a method called "next", which replaces "MoveNext" and "Current". The next() function returns an object that indicates the yielded value and whether the function is over.
function * generator() {
   yield 1;
   return 2;
}
var g = generator();
g.next(); //returns { value: 1, done: false }
g.next(); //returns { value: 2, done: true }
g.next(); //throws exception
Note that "return" is a valid statement in generator functions, as opposed to C#, where mixing the two styles in one function results in a compile error.

Another big difference is that yield in C# is a statement, but yield in JavaScript can be used in an expression. This means the following is valid:
function * generator() {
   return yield 1;
}
var g = generator();
g.next(); //returns { value: 1, done: false }
g.next(); //returns { value: undefined, done: true }
So now, what does the "yield" expression return? In this case, it returned null but clearly it wouldn't be very useful if that was always the case.
Indeed, the yield expression returns the argument of the "next" function. Since we did not specify any arguments, undefined was returned, but that can change:
function * generator() {
   return yield 1;
}
var g = generator();
g.next(); //returns { value: 1, done: false }
g.next(42); //returns { value: 42, done: true }
This isn't the only feature iterators have. Generator objects have another method called "throw". When "throw" is called, execution of the generator function resumes. However, instead of returning a value, yield throws an exception.
function * generator() {
   try {
      yield 1;
      return "success";
   } catch (e) {
      return e;
   }
}
var g = generator();
g.next(); // { value: 1, done: false }
g.next(); // { value: 'success', done: true }
g = generator();
g.next(); // { value: 1, done: false }
g.throw('my exception'); // { value: 'my exception', done: true }
}
As we'll see, these two differences -- yield being an expression and JavaScript iterators having a "throw" function -- are what make asynchronous programming easy with JavaScript yield but cumbersome with C# yield, which ultimately explains why C# needed something new.

Asynchronous programming

In many ways, asynchronous functions are a lot like generators. They perform operations, but then are paused waiting for other events to happen before they can be resumed.
page := Wait for Document to be retrieved from the Internet
words := page.Words
return words.Count
In languages such as Java, the easiest way to do this is to start a new thread and run the operations in that thread, blocking whenever a slow operation is found.
We'd not be using threads to take advance of the multi-core age or anything like that. We'd be using threads to avoid blocking the main program thread because the alternative - to manually set up events and listeners - is boring, ugly and error-prone. Of course, threads have their own overhead.

If we can "pause" and "resume" functions, we no longer need to use threads for basic asynchronous programming.
In C#, the chosen syntax for this was:
public async Task<int> GetWordCount(string url) {
   var page = await DownloadDocumentAsync(url);
   var words = page.Words;
   return words.Count();
}
Note that asynchronous methods are marked with the "async" keyword and return a Task instance.
In JavaScript, there are libraries (e.g. Q) that implement async/await using yield.
var getWordCount = Q.async(function*(url) {
   var page = yield downloadDocument(url);
   var words = page.words;
   return words.count();
});
The end-result is similar. But let's see what Q does to transform yield into await.

Q is a library to implement Promises in JavaScript. In Q, promises are objects with a few special functions. One of the most important is "then". The then function takes two callbacks as arguments which it calls when the function is successfully terminated (first callback) or when the function is terminates with an exception (second callback).

When the next() function is executed, it can yield a promise, return the final value or throw an exception.
If it throws an exception, then Q calls the error callback.
If it returns a value, then Q returns that value.
If it yields a promise, then Q calls the then function on that promise to register two callbacks. The success callback of that promise goes back to the beginning of these steps to call next() again (with the argument of next being the result of the promise) and so on. The error callback of that promise uses the iterator.throw function to notify the generator function that an error occurred. As such, errors in the promise can be caught by a normal try...catch statement.

Emulating Await in C#

In C#, yield wouldn't work as nicely as a replacement for await.
For starters, yield is more restrictive than await. Yield can not be used in anonymous delegates/lambdas. But that's not the main reason.

"value = await promise;" is very useful but it can only possibly compile because "await promise" is an expression. Additionally, unlike what happens with await and JavaScript yield, there is no way to notify the generator function when an exception is supposed to happen.

Still, it can be done. It's not pretty, but it can be done. I've updated my NWarpAsync project to include a library(EmulateAwait) that does exactly this. This is how it looks like:

using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using NWarpAsync.EmulateAwait;

class Program
{
    static IEnumerable<TaskIteration<int>> AsyncMethod(AsyncContext<int> context)
    {
        int arg1 = context.Argument<int>(0);
        string arg2 = context.Argument<string>(1);

        yield return context.Await(Task.FromResult(arg1 + arg2.Length));
        int result = context.GrabLastValue();

        yield return context.Return(result * 2);
    }

    public static void Main()
    {
        Task<int> task = AsyncBuilder.FromGenerator<int>(AsyncMethod, 10, "Hello");
        Console.WriteLine(task.Result); //Prints 30
    }
}
And that's it. I have reused the TPL here because it existed even before await was added to C#. The goal of this library is to allow developers to use yield and await, so whatever the library uses internally is fair.

Whereas NWarpAsync.Yield had a few practical uses, this one does not. It is merely an exercise in "Could X be done to do Y?"
Well, the answer is: Yes, it can. But it shouldn't.

If you're a C# user, stick to await.
But if you're someone developing a new programming language and deciding which features to include, I hope this post was helpful.

Merry Christmas.

Further Reading

luiscubal/NWarpAsync
The official github repository for NWarpAsync (including EmulateAwait).

Coroutine - Wikipedia, the free encyclopedia
Coroutines are special functions that can be suspended and resumed. Yield and Async/Await are two coroutine mechanisms.

luiscubal's code sketchbook: Introducing NWarpAsync - Emulating yield return using await
A previous blog post describing the reverse (implementing yield on top of await)

How yield will transform Node.js
A blog post describing how yield and promises can be used to simplify node.js development.

Saturday, October 26, 2013

Introducing NWarpAsync - Emulating yield return using await

When C# 5.0 arrived, it came with a brand new language feature - async functions.
Although asynchronous programming was already possible in previous versions of C#, it was cumbersome. Previous async patterns included the usage of methods such as BeginInvoke or Task.ContinueWith. These patterns were both based on callbacks and they were cumbersome to use.

With C# 5.0, those methods are (mostly) a thing of the past, and async/await is "The Next Big Thing".

I have designed a small library I called NWarpAsync that uses async/await for purposes that are not what it was primarily designed for. It's called "NWarpAsync" because it twists async. It starts with a "N" because, really, what open-source project for .NET doesn't?

Required Knowledge

In order to understand this post, you'll need to understand C#, including yield return and the new async/await features.

Implementing Yield with Await

The first (and so far only) feature of NWarpAsync is emulation of the yield return feature (available since C# 2.0) using await.
I got the idea from watching slides about using yield in node.js to emulate await and I wondered how easy it would be to implement the reverse.

This might not seem very compelling but this is mostly a proof-of-concept anyway and it does nicely support a few cases that C# yield does not.
Here is how it is used:
using System.Linq;
using NWarpAsync.Yield;

class Program
{
    static void Main()
    {
        foreach (int value in new Yielder(async yieldSink => {
            await yieldSink.Yield(1);
            await yieldSink.YieldAll(Enumerable.Range(2, 3));
            await yieldSink.Yield(5);
        }))
        {
            //Prints 1 2 3 4 5 (each number in a different line)
            Console.WriteLine(value);
        }
    }
}
As you see, using await as yield requires the use of a few helper classes/helper methods, but I believe the final result is still reasonably easy to understand.
YieldAll is equivalent to yielding all elements in an enumerable (or doing nothing if the enumerable is null). F# supports this (with the yield! operator), but C# has no short equivalent. Instead, we're stuck with if (x != null) foreach (var value in x) yield return value;

How it is works

The most important class of NWarpAsync.Yield is the Yielder<T> class. A Yielder<T> takes an asynchronous function (the "generator" or "builder" function) and makes it available as a plain old IEnumerable<T>.
The function that Yielder<T> takes in turn receives a parameter of type YieldSink<T>, which exposes the yielding methods. Most of the magic is done in Yielder and YieldSink.

Yielder.GetEnumerator starts the whole process. It creates an enumerator that executes a step in the function Yielder was constructed with every time MoveNext is called. Once the function stops executing (either because it has really returned or because it is awaiting the result of the Yield method), MoveNext checks if the function has yielded any values. If so, then that is value of IEnumerator.Current. If not, then the enumerator has finished, and MoveNext returns false.

YieldSink.GetAwaiter().OnCompleted registers the action that the next MoveNext will execute.

If the enumerator is disposed (e.g. with LINQ Take) before the function has finished executing, then an exception is thrown in the current execution point of the generator function to allow IDisposable resources to be cleaned up properly and finally clauses to execute.

Main Problems and Limitations

It took some effort to make exceptions work properly.
Initially, I made Yielder<T> receive an Action<YieldSink<T>> (as opposed to a function returning a Task). It turns out that unhandled exceptions are nasty for void asynchronous methods.
Switching to Task made this easier.

Unfortunately, the default behavior of Tasks means that the consumer of Yielder will never see "normal" exceptions. Instead, it'll always receive an AggregateException. This could probably be fixed, but I'm not sure it matters. I'll fix it when I'm convinced it's important.

The second problem occurs when the developer writes sink.Yield(value), but forgets to await it. The result is an ugly exception on the next Yield call which simply could not happen with the native yield. In other words, it's easier to get this approach wrong.

Advantages

If await was implemented back in C# 2.0 and there was a new proposal to add yield, I think this library would pretty much kill the arguments in favor of it. Yield can purely be a library feature if await is a language feature. Of course, it's the other way around and yield is here to stay, so that point does not hold.
That said, if language designers want to make a new language that includes await and are wondering if yield matters, I hope this can be helpful.

As I mentioned above, this library supports a F# yield!-like feature that C# does not have.

However, the biggest advantage is that, unlike yield, async/await can be used in lambdas. This means that, for the first time in C#, it is possible to make an anonymous generator function! This is what I did in the example I showed above.
You can even yield enumerations of enumerations using LINQ SelectMany.

Conclusion

Async/Await is a powerful language feature that allows a method to "pause" and then "resume" later.
Features like yield do the same. As such, it's not too hard to implement yield on top of async/await with the help of a proper library.

I'm curious to find out what other "missing" language features can be implemented using async/await. Perhaps some sort of asynchronous collection?

Further Reading

luiscubal/NWarpAsync
The official github repository.

Coroutine - Wikipedia, the free encyclopedia
Coroutines are special functions that can be suspended and resumed. Yield and Async/Await are two coroutine mechanisms.

yield (C# Reference)
The official documentation for yield in C#

await (C# Reference)
The official documentation for await in C#

await anything; - .NET Parallel Programming - Site Home - MSDN Blogs
A blog post explaining how to use await on objects that aren't Tasks.

How yield will transform Node.js
A blog post showing await on top of async in JavaScript.

Wednesday, October 16, 2013

Null Value Analysis

One of the features I implemented in NRefactory this summer of code was null value analysis. I'll describe here what it is, what it is useful for and what its limitations are.

Required Knowledge

  1. You'll need to understand basic programming concepts, such as how if statements work;
  2. It will help if you understand what directed graphs and control flow graphs (CFGs) are;

The Problem Null Analysis Tries to Solve

Null Value Analysis is a tool (meant to be used by code issues) that tries to predict if a given value is null or not.
Knowing if a value is null is necessary to detect some redundancies, null reference exceptions and unreachable code.

object o = null;
if (o == null) {
   //The if statement is redundant, this is always executed
} else {
   //Unreachable code. This never happens.
}
string s = o.ToString(); //NullReferenceException here
//Since the previous line throws an exception, the following
// code is unreachable.
Console.WriteLine("Unreachable");
Code issues can use this information to warn the user about potential bugs and redundancies.
There are currently two code issues that use null analysis: ConstantNullCoalescingConditionIssue (that detects code like always_null ?? something) and UseOfMemberOfNullReference (that detects always_null.SomeMember). There are other issues that could benefit from this, but that feature was not yet implemented.

Overview

Null analysis is based on the concept of Control Flow Graph (CFG). The CFG for the above method is:
What the null value analyser does is traverse this graph and figure out how each statement changes the local variables.
Let's see what happens:
  1. This method has no parameters, so it doesn't need to setup any parameter variables;
  2. It enters the method;
  3. It declares a new variable called "o" which is "definitely null";
  4. It checks the if (o == null) condition. Because both "o" and "null" are known to be "definitely null", the analyser concludes that the result of the expression is "known to be true";
  5. Because the expression is known to be true, it visits the if block, but not the else block;
  6. It leaves the if block (since it's empty);
  7. It checks the expression o.ToString(). Since ToString is not an extension method nor System.Nullable.HasValue, we know that for this statement to be successful, o must not be null. And yet, o is known to be null. So the null analysis considers this to throw an exception (and indeed it does - a NullReferenceException);
  8. The node is not in a try block, so we end the analysis.
Notice how nodes #8, #9 and #10 are not even visited.

Now, let's see how null analysis handles loops.

object o = null;
while (o == null) {
   Console.WriteLine(o == null ? "Foo" : "Bar");
}
//The end of the loop is never reached 
It can't just keep visiting the loop over and over again. It's an infinite loop, so the analysis would be infinite as well. Instead, it just revisits the loop if it hasn't already visited the loop once for the current set of variable states. But let's see how it behaves in this particular case:
  1. object o is defined as being "definitely null";
  2. Since o is definitely null (because it was declared as such in the previous line), and so is null (by definition), the loop condition is true. It hasn't visited the loop yet, so it'll do so for [o=DefinitelyNull].
  3. It visits Console.WriteLine. Because o is known to be null, the expression will always return "Foo".
  4. The loop ends. Because all variables (just o, in this case) are just the way they were when the loop began [o=DefinitelyNull], there's no need to visit it again.
  5. Note that the code after the loop is never visited, because the condition is always true.
Now consider the following:
object o = null;
object p = null;
while (o == null) {
   Console.WriteLine(p == null ? "Foo" : "Bar");
   p = 1;
}
//The end of the loop is never reached
In this case, each node in the loop is visited twice. The first time for "p is definitely null" and the second time for "p is definitely not null".

The analyser keeps a dictionary (map) containing the local variable states. For instance, in the example above, the contents of the loop are visited once with the state [o=DefinitelyNull p=DefinitelyNull] and a second time with the state [o=DefinitelyNull p=DefinitelyNotNull].
When the node is visited, the analysis tries to figure out what the state of the variables will be after it.

For instance, when the analyser finds the expresssion p=1, it modifies the state dictionary to have [p=DefinitelyNotNull] (while keeping the remaining variables unchanged).

Expressions

Initially, the plan was to have a pure null value analysis. That is, everything was treated being null, not-null or perhaps-null-not-really-sure.
For variables, this turned out to be reasonable in many cases. For expressions, it wasn't.
The problem are expressions such as (x == null ? x : 1). What does that do? Well, if x is null, it returns null. If x is not null, then it returns a value that isn't null. More to the point, the behavior depends on whether "x == null" is true or false.
The analyser can only handle an expression like this if it keeps track of what's true and what's false. As such, I eventually decided that even if only the null status of variables should be stored in the state, keeping track of the boolean value of expressions was essential.
Not only that, the analyser detects that expressions such as "x == null" can only be true if x is null. So even if the value of x is not known at all, it knows that the code in if(x == null) will only be executed if x is null. So the code embedded in the if statement will have [x=DefinitelyNull] even if x was not previously known.

Another problematic case are lambdas. In C#, lambdas are capable of capturing variables.
Consider this code:

object o = null;
Action action = () => { o = 1; };
Foo (action);
//Some more code goes here
So what will be the value of o later in this function? We don't know. It might be null if action is executed. But when will action be executed? We don't know. Maybe imediately, maybe after another method is caled, maybe randomly, by another thread. Again, we don't know.

In these cases, the analyser pretty much gives up and classifies o as captured unknown. Not only it doesn't know the value, it won't ever know the value. Even if we set the variable later, it'll still remain captured.
The only way for a captured variable to become uncaptured is to go out of scope, and then declared again. Also, C# 5 foreach loops uncapture the variable every time the loop starts.
Note that although variables can be captured unknown, expressions can not. The result of an expression with a captured variable is just unknown. If we have a variable named "x" that is captured and run "y = x;", then the value of y is unknown, but it isn't captured. We can always do "y = null;" and it'll become DefinitelyNull, unlike captured variables.

When the algorithm finds ||, &&, ?: or ?? conditions, it tries to exclude operands that are known to never be executed (e.g. y in x ?? y, when x is DefinitelyNotNull). When that is not possible, it tries to "merge" the results of the expression. For instance, (x == null ? null : (x = null)) gives us a result of null and [x=DefinitelyNull] because that's what happens in both outcomes.

Query expressions are more complicated, since we need to consider the possibility of "from x in emptyEnumerable".

Semantics and Annotations

Unfortunately, null value analysis works poorly across functions.

Consider the following code:
int? Foo() {
   return 1;
}
void Bar() {
   //Assume this is the function being analysed
   var x = Foo();
   var y = x ?? 1;
}
In this case, x is never null, so the null coallescing statement is useless.
However, null value analysis doesn't know that. It'll treat "x" as an unknown value.

Fortunately, there's a way around that. Although null value analysis can't figure out what Foo returns, we can help it with annotations.

Null Value Analysis supports a subset of JetBrains.Annotations. These annotations are C# attributes that can be used to provide hints about the possible return values of methods and delegates. Additionally, they can also be applied to fields, properties, delegates and parameters (including out parameters).
[JetBrains.Annotations.NotNull]
int? Foo() {
   return 1;
}
void Bar() {
   //Assume this is the function being analysed
   var x = Foo();
   var y = x ?? 1;
}
In this modified version, the analyser will trigger the warning.

You can use JetBrains' assembly or you can write your own attribute classes. All the analyser cares about is the full class name of the attribute. If you get that right, then it doesn't matter where it comes from.

Annotations can also be used to identify a method as an assertion. Assertions are used to indicate that the program can only continue operating correctly if certain conditions are true -- it is very common to use assertion methods in unit tests.

using JetBrains.Annotations;

class TestClass
{
   [AssertionMethod]
   void Assert([AssertionCondition(AssertionConditionType.IS_TRUE)] bool condition) {
   }
   void TestMethod(string x)
   {
      Assert(x != null);
   }
}
In this case,the analyser will assume that x is not null after the assertion. Normally, the assertion method will throw an exception or abort the program when the condition fails.

Main Limitations

Throughout this post I've mentioned some of the problems of null value analysis. Still, this is important enough to deserve a dedicated section.

First of all, as I mentioned above, null value analysis does not keep track of the value of variables aside from null/not-null.
In particular, this means the analyser is clueless when it comes to recognizing patterns such as this:
bool x = true;
object o = null;
if (x) {
   o = "";
}
In this example, o will always end with the value "", but the analyser doesn't know that. It doesn't even recognize that o can't end up null. Because it doesn't keep track of the value of x, so it must assume that there's a chance the if statement won't be entered.
This limitation only applies to variables. The analyser does detect when specific expressions always return true.
I am convinced that boolean values could be added to the analyser. But the algorithm would be too slow for integers and even ocasionally get stuck in infinite loops for strings.
The reason this would happen is because every time a previously visited node is reached more than once (this happens e.g. in loops) the analyser must decide whether to re-visit the node. For that, it checks what values the variables in scope had the last times that node was visited. If the variables are the same, nothing changed, so there's no point in visiting the node again. If the variables changed, then revisiting the node is necessary for correctness.
This works for null/not-null because there are only two possible values. Even considering states such as "Maybe Null" or "Unknown, Captured by Lambda", the total number of states remains low. Moreover, many states "include" others. That is, if a node was visited for [x=MaybeNull], it won't be revisited for [x=DefinitelyNull], because [DefinitelyNull] is included in the [MaybeNull] scenario.
This means that there is an upper bound to the number of times a node can be visited (depending on the number of variables in scope and the possible values a variable can have). Most functions have few variables changing values all the time.
For booleans, everything is still okay. But for integers, the number of possible values a variable can have goes up and, with it, the number of times a node could be revisited. A lot.

int i = int.MinValue;
while (i != int.MaxValue) {
   ++i;
}

This small function triggers a sore spot in an hypothetical analyser that tracks the value of integers. The null analyser isn't too bothered by this loop. It only enters it once.
But if the analyser kept track of the value of i, then it would detect that the value of the loop has changed every iteration. Because of that, it'd revisit the loop. But that would change the value again. So the analyser would visit the loop again. And again. And again. Around 2^32 times.

Strings have infinite-for-all-practical-purposes possible combinations, and as such would trigger infinite loops in the "smarter" analyser.

For integers and strings, a completely different algorithm would be required.

The second major limitation is the way expressions are handled when it comes to control flow.
The Null Value Analyser is based on NRefactory's built-in Control Flow Graph classes. Unfortunately, these classes are very statement-centric and do not handle expression-level control flow.
In other words, the CFGs know how to treat if statements, but not ? : expressions.

My solution was to patch a limited form of expression control flow detection on top of the statement CFG.
Whenever an if statement is encountered, the analyser handles the condition, then schedules two nodes to be visited -- the "true" case node, and the "false" case node. Two visits.
Whenever || is encountered, the analyser handles the left side, then handles the right side, then combines the result of both sides and continues. When the expression statement is over, the next node is scheduled to be visited based on the combined result. One visit.
This means the analyser is faster but less accurate for expressions than for statements.
if (x == null || y == null) {
   if (x != null) {
      // y is always null here, but the analyser doesn't know that,
      // because the top if statement only saw the combined result
      // of the || expression.
   }
}

The third major limitation is related to missing method annotations. Specifically, Null Value Analysis does not implement the complex ContractAnnotationAttribute. This attribute would allow programmers to give the analyser highly specialized hints, such as "this function never returns true if the parameter is null".
Because Null Value Analysis doesn't support any equivalent annotation, methods such as string.IsNullOrEmpty can't be supported properly.

The final major limitation is the lack of external annotations.
The analyser currently only recognizes annotations in the source code itself. For third-party code, the only way to add annotations is to modify the source code. This is undesirable (not to mention highly impractical) and external annotations are definitely on the wishlist, but nothing is currently available.
In short, most libraries (including standard .NET libraries) are currently hostile to the analyser.

Final Words

I hope null value analysis ends up being useful. It can certainly work reasonably well in theory and in a few test projects (notably NRefactory, MonoDevelop and xsp) but the world of code is a lot vaster than that, so it's still unknown if it'll survive under the harsh light of real-world usage. Time will tell, I guess.

This year's GSoC is over and I passed the final evaluation. Overall, I think null analysis was the hardest thing I wrote.
I enjoyed working in MonoDevelop and NRefactory and I'll certainly try to contribute again, if I have some time to spare. I actually learned a few new C# features and I got exposed to some patterns and paradigms I hadn't given much though before.

And that's it for today.

Further Reading

Official NRefactory github repository (icsharpcode/NRefactory)
This is where the upstream version lives.

Summer of Code 2013 NRefactory github repository (mono-soc-2013/NRefactory)
This is where the students push their code to before trying to get it upstream.

Control flow graph - Wikipedia, the free encyclopedia
Wikipedia explanation of what a CFG is.

Halting problem - Wikipedia, the free encyclopedia
Undecidable problem (as in, problem for which an exact solution is known to be inexistent) on computers. A perfect null value analyser would be capable of solving the halting problem and, as such, is impossible to make.

Thursday, August 8, 2013

Introducing C# Split Project

There are many good reasons to have many projects in one solution.
  1. Having more projects means having looser coupling.
  2. If each project has an assigned purpose, this tends to ensure better cohesion.
  3. Finally, if the projects are libraries, and the users of those libraries only need a few of the provided features, only the few projects with those features need to be referenced, ensuring less code is imported (which might mean smaller application packages).
Unfortunately, we don't always have many projects in a solution. Sometimes we just have a few... or just one - and we'd like to take portions of those few projects and create new ones.
Doing so currently involves manually creating the projects, adding the proper references, moving the files and praying that the coupling of our files was sufficiently loose to allow us to get away with it.

Introducing Split Project, one of the features I've implemented as part of my Google Summer of Code project.
Let's start with a basic project with only the following files:

//Program.cs
using System;

namespace HelloWorld
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            PrintService.PrintMessage ("Hello, World!");
            PrintService.Printer = new SpecialConsolePrinter ();
            PrintService.PrintMessage ("Hello again");
        }
    }
}
//IPrinter.cs
using System;

namespace HelloWorld
{
    public interface IPrinter
    {
        void PrintMessage(string message);
    }
}
//DefaultConsolePrinter.cs
using System;

namespace HelloWorld
{
    public class DefaultConsolePrinter : IPrinter
    {
        public void PrintMessage(string message)
        {
            Console.WriteLine(message);
        }
    }
}
//SpecialConsolePrinter.cs
using System;

namespace HelloWorld
{
    public class SpecialConsolePrinter : IPrinter
    {
        public void PrintMessage(string message)
        {
            Console.WriteLine("[SPECIAL] {0}", message);
        }
    }
}
//PrintService.cs
using System;

namespace HelloWorld
{
    public static class PrintService
    {
        IPrinter printer;
        public IPrinter Printer
        {
            get {
                if (printer == null) {
                    printer = new DefaultConsolePrinter ();
                }
                return printer;
            }
            set {
                if (value == null) {
                    throw new ArgumentNullException ("value");
                }
                printer = value;
            }
        }

        public static void PrintMessage(string message)
        {
            Printer.PrintMessage(message);
        }
    }
}
And now, we want to see if we can move some portions of this project to a new one.
We can do this manually, of course. PrintService depends on IPrinter and DefaultConsolePrinter and Program depends on PrintService and SpecialConsolePrinter. But doing this manually is no fun and particularly error-prone, especially when we start considering bigger projects.

In this example, we chose to move PrintService to the new project. Because PrintService depends on IPrinter and DefaultConsolePrinter, both have been automatically selected.
Note how "PrintService.cs" appears in "Dependants". This means that the file is required because of that file. There is no way to move PrintService without moving IPrinter and attempting to uncheck IPrinter will also uncheck PrintService.
On the other hand, Program.cs, AssemblyInfo.cs and SpecialConsolePrinter.cs aren't selected because PrintService does not depend on those.
The "Ok" button appears disabled because we have not yet chosen a name for the project.

This time, we have also selected Program.cs. Program.cs depends on everything except on AssemblyInfo. Note that some files (such as IPrinter) show both Program.cs and PrintService.cs in "Dependants". This is because both files that were directly selected by the user depend on them.

Split Projects supports an additional feature. Instead of selecting files one by one, you can select whole folders to be moved.

In this case, we have selected the "Backend" folder to be moved. All files within have been marked as directly chosen to be moved by the user.

Let's see what happens when we click the "Ok" button:


And that's it. The project compiled and ran without any errors.

Future Work

There are a few additional features and cases I'd like to try and add to the dialog in the feature:
  1. Sometimes, the dialog does not work in the presence of certain useless "using Some.Namespace;" declarations in a moved file. If no class in Some.Namespace is moved, then the useless "using" declarations will become compile errors. This can be easily fixed by removing the erroneous declarations, but it can be a nuisance anyway.
  2. If a file has many type declarations, then it would be nice having the option to move only certain type declarations to the new project.
I hope this improves the lives of programmers trying to organize their solutions better.

Monday, July 8, 2013

Google Summer of Code Week 3

Last week I got less work done than the previous ones. That said, I still got some things to show.

First of all, I finally managed to complete Extract Enum, though I ended up moving it back to MonoDevelop.

The main problem o DoGlobalOperationOn is the lack of NRefactory unit testing infrastruture. Whereas most actions and issues are easily tested and, therefore, require only minimal testing in the IDE, DoGlobalOperationOn must be manually tested all the way, which is massively inconvenient.

Also, not all extract enum features are fully available yet. It is not yet possible to choose which fields to include in the new enum due to a bug in Xwt.Gtk. Once that's fixed, I believe everything will work as expected.

I also investigated ILSpy to see how easy it would be to reuse code from it to implement a new code action: Convert LINQ fluent syntax to LINQ query. This would essentially be the opposite action of what I made last week.

The portion of ILSpy I studied was the decompiler. The decompiler takes .NET code (e.g. an executable or class library) and generates C# code back from it. It includes a variety of features including the ability to convert generated LINQ code back to LINQ queries.
The decompiler of LINQ queries seems to work in two phases: First, it generates simple queries, and then combines those queries:
list.Where(x => x > 0).Select(y = > y * 2);
//After phase 1
from y in from x in list
          where x > 0
          select x
select y * 2;
//After phase 2
from x in list
where x > 0
into y
select y * 2;

The code for the first phase (introduce queries) seems pretty solid, but the code for the second phase is far less so. It depends heavily on the particular output of the decompiler and, as such, is unsuitable for use in user code.

This week, I'll try to implement global operation unit testing and some other actions.

Monday, July 1, 2013

Code reuse and Global Operations - Google Summer of Code Week 2

This is the second week of my Google Summer of Code project. Last week, I implemented only two small and simple code actions. This week, I got a bit more ambitious.
Technically, some of this work was finished after the Midnight of Sunday, I'll leave it to the reader to decide if that's "cheating".

Null coallescing and pattern matching

The first action I'll be talking about is ConvertIfToNullCoalescing action.
//This action takes this:
object x = Foo();
if (x == null) x = Bar();

//And replaces it by this:
object x = Foo() ?? Bar();
This is a very simple action - similar to the stuff I wrote last week.
What changes this time is the heavy use of pattern matching.

With pattern matching, it's possible to define a certain pattern (e.g. if (someExpression) doSomething();) and check if a node matches that pattern. For instance, the pattern if (<Any>) { <Any> } recognizes if (true) { return 0; }, but not while (true) {} nor if (true) return 0;.

It's also possible to have named nodes. With named names, it's possible to know what matched certain parts of the pattern. For instance, using pattern if (<condition> = <Any>) {} to match if (true) {}, named nodes tell us that the condition is true.

Of course, this is a very simple example. Pattern matching gets more useful in more complex cases.
Also, note that patterns are created as normal C# objects - the string representation I used above was merely intended as an example.

Dispose and IDisposable

Another simple action I did was search for types with a void Dispose() method that didn't implement IDisposable. This issue detects and fixes the problem for classes, though it is disabled for interfaces.
public class Test
{
    public void Dispose() {}
}
//Is converted to
public class Test : System.IDisposable
{
    public void Dispose() {}
}
The issue is disabled for interfaces. I initially implemented that feature but ultimately decided to remove it due to a problematic edge case with explicit implementation:
interface IA {
    void Dispose();
}
class B : IA {
    void IA.Dispose() {}
}
//Let's say the action converted this to:
interface IA : System.IDisposable
{
}
class B : IA {
    void IA.Dispose() {} //ERROR here
}
The only way for this action to be effective for explicit interfaces would be to use a global operations, and I only figured out how to do this effectively on Sunday. More on global operations later.

LINQ Query and LINQ Fluent

One of the best features of .NET and C# is LINQ.
LINQ brings a declarative/functional taste to C# and greatly improves the readability of list operations - like SQL but better.

There are many ways to use LINQ in C#.
  1. One could use the methods of System.Linq.Enumerable, such as Enumerable.Where(myEnumerable, SomeFunction). This could be acceptable if there was no better way of doing things -- which there is;
  2. Use those same methods from Enumerable but in a different way. Those are extension methods and, therefore, can be used as myEnumerable.Where(SomeFunction). This is much better than the first option, especially because it's a Fluent Interface and therefore method calls can be chained -- myEnumerable.Where(Foo).Select(Bar)
  3. The last option is the query syntax. Instead of using the method calls, the programmer can use special syntax designed specifically for this purpose. More on this later
Query syntax is translated by the C# compiler to fluent syntax.
var seq = from item in myEnumerable
          where Foo(item)
          select Bar(item);

//Is converted to:
var seq = myEnumerable.Where(item => Foo(item)).Select(item => Bar(item));
Because query syntax is blindly converted to method calls, it can be used for a lot more than just Enumerables. Anything with the correct methods (e.g. Where and Select) can be used with query syntax, it doesn't need to be an extension method - a normal instance method works just fine.

And, in fact, the argument doesn't even need to be of type System.Func.

LINQ is not just used for enumerables. Some of the best uses of LINQ are related to databases.
int idToFind = 10;
var userWithId = from user in database.Users
                 where user.Id == idToFind
                 select new { user.Name, user.Score };
Console.WriteLine("User {0} has {1} point(s).", userWithId.Name, userWithId.Score);
//The query is translated to
int idToFind = 10;
var userWithId = database.Users.Where(user => user.Id == idToFind)
                     .Select(user => new { user.Name, user.Score });
Console.WriteLine("User {0} has {1} point(s).", userWithId.Name, userWithId.Score);
In this case, database.Users is not an enumerable. It's an instance of IQueryable.

Instances of IQueryable actually read the contents of the passed lambdas -- they use Expression Trees. And it works just fine.

So which is best? Query syntax or fluent syntax? It depends.
In some cases, fluent syntax is the simplest and most compact solution.
In other cases, query syntax is the most readable option - since joins and other complex queries do not involve complex method calls with anonymous types.

The action I implemented converts query syntax to fluent syntax. It should be capable of handing any query. So, whenever a query would be best written with fluent syntax, there's no need to manually convert it (and risking introducing new bugs!) - just let the code action do it.

This code action turned out to require less effort than I expected, since NRefactory already had a class to do what I needed - QueryExpressionExpander. Sadly, I didn't know about that class, so I ended up reading the relevant portion of the C# specification and writing my own code. Later, when I found out about QueryExpressionExpander, I rewrote a significant portion of my own work to use it - fixing a bug in NRefactory and adding a feature to QueryExpressionExpander along the way.

As it turns out, the C# specification is surprisingly readable. I was expecting an impenetrable wall of technical text but instead I was greeted by a text with lots of examples and simple to understand.

Convert to Enum - Global Operations

The last action I wrote this week is again a tale of finding the right class to use.

Anyway, the story of this action started with an old bug report (and by old I mean half a year old). An user with lots of Java code automatically converted to C# had lots of fields like public const int SOME_PREFIX_SOME_NAME and wanted to convert all those fields to enumerations. Code actions to the rescue!
//We have this
public const int SOME_PREFIX_A = 1;
public const int SOME_PREFIX_B = 2;
//We want this
public enum SOME_PREFIX : int
{
    A = 1,
    B = 2
}
I think it's not an overstatement to say this was the hardest action to implement so far. And I'm not even sure I'm done yet.

The main problem with this action that references to SOME_PREFIX_A become references to SOME_PREFIX.A. This means changing files besides the one with the field declaration.
It's not a simple find-and-replace either. We don't want to change strings nor comments. And we don't want to replace different fields with the same name.
As if things weren't bad enough, enumerations aren't implicitly cast to integers. The "perfect" version of this action would also replace method parameter types. This would soon become tremendously complex.
I opted to develop a more limited version of this conversion. Instead of replacing types everywhere, this just adds a type cast to the enum underlying type. So instead of SOME_PREFIX.A, we'd have ((int) SOME_PREFIX.A).

So, back to global operations. A global operation is an operation that changes files other than the current one. NRefactory has a method just for that: script.DoGlobalOperationOn.

Now, it would be just fine if  I were to just use that and be done with it. Unfortunately, I only discovered that method (and what it really did) very late. Before finding it out, I had to go on a tour discovering MonoDevelop source code. While the information I learned has helped me understand MonoDevelop better, I went full circle and ended up where I started - in NRefactory.

This code action is still not fully tested, so it might not work perfectly in some edge cases yet.

Conclusion

This week's work was harder than last week's. Still, I managed to get most of my planned work done.
Additionally, I'm gradually learning about new APIs of NRefactory, so I expect things to get easier as I get more experienced.

The main problem this week was spending time figuring out how to do things, only to find out there was an API just for that. Fortunately, the number of APIs in NRefactory is finite, so I'm hoping this situation won't last forever.

See Also

Google Summer of Code - Week 1
Last week's post

mono-soc-2013/NRefactory
As usual, my work is available in the mono-soc-2013 NRefactory github repository. Check the branches with names starting with "luiscubal-". The exception is code that's been merged back into the master branch of the official repository. Those branches have been deleted.

Download C# Language Specification 5.0 from Official Microsoft Download Center
The spec also comes bundled with Visual Studio (except Express). So if you already have it installed, just go to the VC#\Specifications\1033 folder. The exact location of this folder will depend on where you decided to install Visual Studio, but Program Files is a good place to start the search.

Sunday, June 23, 2013

Google Summer of Code - Week 1

First of all, the good news: I have been accepted into Google Summer of Code 2013. I'll be working with MonoDevelop and NRefactory to add new source analysis features.
I'll start by explaining what these projects are, then I'll show my work this week and finish with a commentary of what I saw and learned this week.

An Introduction

MonoDevelop is a cross-platform IDE that supports several programming languages. Among those languages, the most important one is C#.

NRefactory is a C# library that allows applications to parse and manipulate C# source code (it also supports a few other languages, but my project won't cover those).
With NRefactory, it is possible to analyze source code for patterns and programmatically modify the code. This can be extremely helpful when refactoring.

NRefactory includes code actions and code issues.

Actions are simple ways to modify code to something slightly different. For instance, there is a code action to turn a variable declaration with type inference into a variable declaration with the explicit type.

//This:
var x = new List<int>();
//Becomes:
List<int> x = new List<int>();

Issues detect anti-patterns: Code that is likely hard to maintain and a potential source of bugs, or just poor taste - often, issues can be automatically fixed.
For instance, there is an issue that detects when a variable is assigned to itself.

class SomeType
{
    int x;

    public Constructor(int x) {
        x = x; // <-- This is useless, the programmer probably meant this.x = x
    }
}

Actions and issues are not the only features of NRefactory. It can be used to implement code completion, format code and pretty much anything that requires code manipulation.

MonoDevelop integrates NRefactory to provide a first-class programming experience. Right-clicking the code shows the applicable issues and actions. Issues are harder to ignore than actions, since the IDE will show those more prominently - bad code usually appears with a curly green underline.

Before proceeding, it might be a good idea to read Using NRefactory for analysing C# code on CodeProject. This post will be easier to understand if you do.

My Project - Week1

This summer, I will be adding new source analysis features to NRefactory and MonoDevelop. This is officially my first week of work. Because I've been busy with exams, I haven't been able to fully dedicate myself to this project. But still, I did manage to get some decent work done.

This week, I implemented one code issue, one code action and a bug fix.

lock(this) and MethodImplOptions.Synchronized


The code issue detects a synchronization anti-pattern.

Proper synchronization is a critical - yet hard - part of multi-threading. When synchronization is done right, threads work correctly. Poorly done synchronization can cause lots of headaches (see deadlocks and race conditions).

Because synchronization is so error-prone, it should be as surprise-free as possible. Synchronization code should be straightforward and have as few "gotchas" as possible - ideally, none.

If it's not obvious what code is locking what mutexes, then the code really isn't straightforward. If some random programmer using a library can lock a mutex used by the library and the original library author doesn't know about it - that can be a huge problem.

In C#, locking is done with the lock statement. Here's an example of what it looks like:

class SynchronizedExample
{
    string someString = string.Empty;
    public string SomeString {
        get { return someString; }
    }

    public void AddToString(string stringToAdd) {
        lock (this) { //this line is bad, we'll soon see why
            someString += stringToAdd;
        }
    }
}
The code above ensures there are no race conditions when appending data to a string. The this object is used as a mutex so two different objects can access their own strings simultaneously in different threads, but multiple threads can't append to the same string at the same time - if they try, one of them is forced to wait.

lock(this) is not a good idea.

The reason is that anyone with a SynchronizedExample instance can lock on it.
var example = new SynchronizedExample();
lock (example) { ... }

In practice, this means the original author of SynchronizedExample can never really know when the mutex is locked. This code is ticking bomb. It isn't blowing up, but it could at any time.

Ticking bombs are good candidates for code issues. The new LockThisIssue detects this problem and suggests a fix. The code above is underlined and right-clicking shows a message warning the programmer that lock(this) is bad. Clicking on auto-fix turns it into this:

class SynchronizedExample
{
    string someString = string.Empty;
    public string SomeString {
        get { return someString; }
    }

    object locker = new object();
    public void AddToString(string stringToAdd) {
        lock (locker) {
            someString += stringToAdd;
        }
    }
}

Well, the comment wouldn't be deleted, but other than that, everything is fine now. Synchronization remains correct, but now it's gotcha-free. No unpleasant surprises waiting to happen.

This is not all this issue corrects.

.NET has an attribute called MethodImplAttribute. This attribute exposes several features. One of those features is synchronization.
Exactly why .NET has this "feature" is unclear but, in short, it is roughly equivalent to adding a lock (this) statement that includes the entire method.
It sucked before, it still sucks now. So the issue detects the problem, removes or fixes the attribute and adds the proper lock statement.

Auto-Linq Sum


As I said before, actions allow transforming code. The code doesn't have to be a red flag. Actions can just show an alternative. Use it if you like the alternative, ignore it if you don't.

The action I implemented converts some foreach loops to LINQ expressions.

int result = 0;
foreach (int x in intList) {
    result += x;
}
//Is converted to:
int result = intList.Sum();


It also supports if/else statements in the loop, multiple statements and more complex expressions.

int result = 0;
foreach (int x in intList) {
    if (x > 0) {
        result += x * 2;
    }
}
//Is converted to:
int result = intList.Where(x => x > 0).Sum(x => x * 2);
Using a loop is not wrong, so it's not a code issue. But I believe loop-to-LINQ is still a refactoring tool worth having.

Final Notes

Both MonoDevelop and NRefactory are interesting projects and I expect my work to make it to future versions. In fact, the two major features I described in this post have already been merged into master.

Here are a few of the things I've found:

git submodules: Essentially, when Project A depends on Project B, the git repository for Project A can include the git repository for Project B as a submodule. The submodules have their own separate commits and branches - they are essentially independent since they are a separate repository. This seems like a good way to handle dependencies.

Unit Tests: NRefactory has the most extensive unit test suite I've ever seen and it's pretty cool. So far, I haven't had many opportunities to see unit tests in action. Most of the unit tests I've seen were really just examples, but NRefactory is a real-world case that really proves how much tests can help.
For instance, when testing a code action, the programmer can say which action to test, the original source code and the intended output. If they match, the test passes. Otherwise, the test fails.
One can just add a bunch of normal cases, a few harder ones (for edge cases) and tests to detect potential problems. NRefactory was the first time I ever used Test-Driven Development in a real-world scenario. I've found myself writing tests when the code action was merely a stub and I'm satisfied with the results.

Linking: The API does not always behave the way I'd expect it to behave. I've been bitten in a few ways since I started, though I'm getting used to the details quickly.
One example is linking: script.Link is a neat feature. It tells the IDE that some identifiers are actually one and the same.
If an action creates lots of statements in the code and they all refer to the same variable, Link lets the user choose a name and the whole code will be fixed to use that name.
Now, Link is not supposed to be applied to code that's been removed. This is perfectly acceptable (why change what no longer exists?), but nothing in the code detects when it happens accidentally. It is what I'd call "undefined behavior".
The tests just silently ignore it. Everything will seem to work except for some puzzling extra whitespace that shouldn't exist. But in MonoDevelop, that same code will behave differently - and generate code that's just wrong. Of course, the solution is not to use it that way, but again that does not make debugging easy when it happens by accident.
See the bug report (pictures included).

Edge cases: C# is a big language with lots of features. This is great for the programmer, but it does mean that tools such as NRefactory must be tested in lots of edge cases. In fact, pretty much all bugs I found in NRefactory (the ones that weren't my fault, that is) were related to features the original author just didn't think of when writing the code: type inference, anonymous types, extension methods, generics and the many things that can be on the right side of a lambda expression (is it a statement? An expression? Is is a method with void return type?).

This concludes my work for the week.

Further Reading

Official NRefactory github repository (icsharpcode/NRefactory)
This is where the code for NRefactory is.

Mono Summer of Code 2013 NRefactory github repository (mono-soc-2013/NRefactory)
The code for this year's mono summer of code projects. My branches are prefixed with "luiscubal-" so they should be easy to find. Once my pull requests are accepted (or no longer applicable), my corresponding branch is deleted, so the work I described here is found on the official NRefactory repository.

Mike's MonoDevelop Blog: How To: Write a C# Issue Provider
This post explains how to create a new code issue for NRefactory

Source Analysis Improvements in MonoDevelop - Project details
My summer of code project page. Very little to see here, move along. (I should probably improve it... one of these days)

.net - What does MethodImplOptions.Synchronized do? - Stack Overflow
A very brief and oversimplified explanation of what MethodImplOptions.Synchronized does.