Wednesday, October 16, 2013

Null Value Analysis

One of the features I implemented in NRefactory this summer of code was null value analysis. I'll describe here what it is, what it is useful for and what its limitations are.

Required Knowledge

  1. You'll need to understand basic programming concepts, such as how if statements work;
  2. It will help if you understand what directed graphs and control flow graphs (CFGs) are;

The Problem Null Analysis Tries to Solve

Null Value Analysis is a tool (meant to be used by code issues) that tries to predict if a given value is null or not.
Knowing if a value is null is necessary to detect some redundancies, null reference exceptions and unreachable code.

object o = null;
if (o == null) {
   //The if statement is redundant, this is always executed
} else {
   //Unreachable code. This never happens.
}
string s = o.ToString(); //NullReferenceException here
//Since the previous line throws an exception, the following
// code is unreachable.
Console.WriteLine("Unreachable");
Code issues can use this information to warn the user about potential bugs and redundancies.
There are currently two code issues that use null analysis: ConstantNullCoalescingConditionIssue (that detects code like always_null ?? something) and UseOfMemberOfNullReference (that detects always_null.SomeMember). There are other issues that could benefit from this, but that feature was not yet implemented.

Overview

Null analysis is based on the concept of Control Flow Graph (CFG). The CFG for the above method is:
What the null value analyser does is traverse this graph and figure out how each statement changes the local variables.
Let's see what happens:
  1. This method has no parameters, so it doesn't need to setup any parameter variables;
  2. It enters the method;
  3. It declares a new variable called "o" which is "definitely null";
  4. It checks the if (o == null) condition. Because both "o" and "null" are known to be "definitely null", the analyser concludes that the result of the expression is "known to be true";
  5. Because the expression is known to be true, it visits the if block, but not the else block;
  6. It leaves the if block (since it's empty);
  7. It checks the expression o.ToString(). Since ToString is not an extension method nor System.Nullable.HasValue, we know that for this statement to be successful, o must not be null. And yet, o is known to be null. So the null analysis considers this to throw an exception (and indeed it does - a NullReferenceException);
  8. The node is not in a try block, so we end the analysis.
Notice how nodes #8, #9 and #10 are not even visited.

Now, let's see how null analysis handles loops.

object o = null;
while (o == null) {
   Console.WriteLine(o == null ? "Foo" : "Bar");
}
//The end of the loop is never reached 
It can't just keep visiting the loop over and over again. It's an infinite loop, so the analysis would be infinite as well. Instead, it just revisits the loop if it hasn't already visited the loop once for the current set of variable states. But let's see how it behaves in this particular case:
  1. object o is defined as being "definitely null";
  2. Since o is definitely null (because it was declared as such in the previous line), and so is null (by definition), the loop condition is true. It hasn't visited the loop yet, so it'll do so for [o=DefinitelyNull].
  3. It visits Console.WriteLine. Because o is known to be null, the expression will always return "Foo".
  4. The loop ends. Because all variables (just o, in this case) are just the way they were when the loop began [o=DefinitelyNull], there's no need to visit it again.
  5. Note that the code after the loop is never visited, because the condition is always true.
Now consider the following:
object o = null;
object p = null;
while (o == null) {
   Console.WriteLine(p == null ? "Foo" : "Bar");
   p = 1;
}
//The end of the loop is never reached
In this case, each node in the loop is visited twice. The first time for "p is definitely null" and the second time for "p is definitely not null".

The analyser keeps a dictionary (map) containing the local variable states. For instance, in the example above, the contents of the loop are visited once with the state [o=DefinitelyNull p=DefinitelyNull] and a second time with the state [o=DefinitelyNull p=DefinitelyNotNull].
When the node is visited, the analysis tries to figure out what the state of the variables will be after it.

For instance, when the analyser finds the expresssion p=1, it modifies the state dictionary to have [p=DefinitelyNotNull] (while keeping the remaining variables unchanged).

Expressions

Initially, the plan was to have a pure null value analysis. That is, everything was treated being null, not-null or perhaps-null-not-really-sure.
For variables, this turned out to be reasonable in many cases. For expressions, it wasn't.
The problem are expressions such as (x == null ? x : 1). What does that do? Well, if x is null, it returns null. If x is not null, then it returns a value that isn't null. More to the point, the behavior depends on whether "x == null" is true or false.
The analyser can only handle an expression like this if it keeps track of what's true and what's false. As such, I eventually decided that even if only the null status of variables should be stored in the state, keeping track of the boolean value of expressions was essential.
Not only that, the analyser detects that expressions such as "x == null" can only be true if x is null. So even if the value of x is not known at all, it knows that the code in if(x == null) will only be executed if x is null. So the code embedded in the if statement will have [x=DefinitelyNull] even if x was not previously known.

Another problematic case are lambdas. In C#, lambdas are capable of capturing variables.
Consider this code:

object o = null;
Action action = () => { o = 1; };
Foo (action);
//Some more code goes here
So what will be the value of o later in this function? We don't know. It might be null if action is executed. But when will action be executed? We don't know. Maybe imediately, maybe after another method is caled, maybe randomly, by another thread. Again, we don't know.

In these cases, the analyser pretty much gives up and classifies o as captured unknown. Not only it doesn't know the value, it won't ever know the value. Even if we set the variable later, it'll still remain captured.
The only way for a captured variable to become uncaptured is to go out of scope, and then declared again. Also, C# 5 foreach loops uncapture the variable every time the loop starts.
Note that although variables can be captured unknown, expressions can not. The result of an expression with a captured variable is just unknown. If we have a variable named "x" that is captured and run "y = x;", then the value of y is unknown, but it isn't captured. We can always do "y = null;" and it'll become DefinitelyNull, unlike captured variables.

When the algorithm finds ||, &&, ?: or ?? conditions, it tries to exclude operands that are known to never be executed (e.g. y in x ?? y, when x is DefinitelyNotNull). When that is not possible, it tries to "merge" the results of the expression. For instance, (x == null ? null : (x = null)) gives us a result of null and [x=DefinitelyNull] because that's what happens in both outcomes.

Query expressions are more complicated, since we need to consider the possibility of "from x in emptyEnumerable".

Semantics and Annotations

Unfortunately, null value analysis works poorly across functions.

Consider the following code:
int? Foo() {
   return 1;
}
void Bar() {
   //Assume this is the function being analysed
   var x = Foo();
   var y = x ?? 1;
}
In this case, x is never null, so the null coallescing statement is useless.
However, null value analysis doesn't know that. It'll treat "x" as an unknown value.

Fortunately, there's a way around that. Although null value analysis can't figure out what Foo returns, we can help it with annotations.

Null Value Analysis supports a subset of JetBrains.Annotations. These annotations are C# attributes that can be used to provide hints about the possible return values of methods and delegates. Additionally, they can also be applied to fields, properties, delegates and parameters (including out parameters).
[JetBrains.Annotations.NotNull]
int? Foo() {
   return 1;
}
void Bar() {
   //Assume this is the function being analysed
   var x = Foo();
   var y = x ?? 1;
}
In this modified version, the analyser will trigger the warning.

You can use JetBrains' assembly or you can write your own attribute classes. All the analyser cares about is the full class name of the attribute. If you get that right, then it doesn't matter where it comes from.

Annotations can also be used to identify a method as an assertion. Assertions are used to indicate that the program can only continue operating correctly if certain conditions are true -- it is very common to use assertion methods in unit tests.

using JetBrains.Annotations;

class TestClass
{
   [AssertionMethod]
   void Assert([AssertionCondition(AssertionConditionType.IS_TRUE)] bool condition) {
   }
   void TestMethod(string x)
   {
      Assert(x != null);
   }
}
In this case,the analyser will assume that x is not null after the assertion. Normally, the assertion method will throw an exception or abort the program when the condition fails.

Main Limitations

Throughout this post I've mentioned some of the problems of null value analysis. Still, this is important enough to deserve a dedicated section.

First of all, as I mentioned above, null value analysis does not keep track of the value of variables aside from null/not-null.
In particular, this means the analyser is clueless when it comes to recognizing patterns such as this:
bool x = true;
object o = null;
if (x) {
   o = "";
}
In this example, o will always end with the value "", but the analyser doesn't know that. It doesn't even recognize that o can't end up null. Because it doesn't keep track of the value of x, so it must assume that there's a chance the if statement won't be entered.
This limitation only applies to variables. The analyser does detect when specific expressions always return true.
I am convinced that boolean values could be added to the analyser. But the algorithm would be too slow for integers and even ocasionally get stuck in infinite loops for strings.
The reason this would happen is because every time a previously visited node is reached more than once (this happens e.g. in loops) the analyser must decide whether to re-visit the node. For that, it checks what values the variables in scope had the last times that node was visited. If the variables are the same, nothing changed, so there's no point in visiting the node again. If the variables changed, then revisiting the node is necessary for correctness.
This works for null/not-null because there are only two possible values. Even considering states such as "Maybe Null" or "Unknown, Captured by Lambda", the total number of states remains low. Moreover, many states "include" others. That is, if a node was visited for [x=MaybeNull], it won't be revisited for [x=DefinitelyNull], because [DefinitelyNull] is included in the [MaybeNull] scenario.
This means that there is an upper bound to the number of times a node can be visited (depending on the number of variables in scope and the possible values a variable can have). Most functions have few variables changing values all the time.
For booleans, everything is still okay. But for integers, the number of possible values a variable can have goes up and, with it, the number of times a node could be revisited. A lot.

int i = int.MinValue;
while (i != int.MaxValue) {
   ++i;
}

This small function triggers a sore spot in an hypothetical analyser that tracks the value of integers. The null analyser isn't too bothered by this loop. It only enters it once.
But if the analyser kept track of the value of i, then it would detect that the value of the loop has changed every iteration. Because of that, it'd revisit the loop. But that would change the value again. So the analyser would visit the loop again. And again. And again. Around 2^32 times.

Strings have infinite-for-all-practical-purposes possible combinations, and as such would trigger infinite loops in the "smarter" analyser.

For integers and strings, a completely different algorithm would be required.

The second major limitation is the way expressions are handled when it comes to control flow.
The Null Value Analyser is based on NRefactory's built-in Control Flow Graph classes. Unfortunately, these classes are very statement-centric and do not handle expression-level control flow.
In other words, the CFGs know how to treat if statements, but not ? : expressions.

My solution was to patch a limited form of expression control flow detection on top of the statement CFG.
Whenever an if statement is encountered, the analyser handles the condition, then schedules two nodes to be visited -- the "true" case node, and the "false" case node. Two visits.
Whenever || is encountered, the analyser handles the left side, then handles the right side, then combines the result of both sides and continues. When the expression statement is over, the next node is scheduled to be visited based on the combined result. One visit.
This means the analyser is faster but less accurate for expressions than for statements.
if (x == null || y == null) {
   if (x != null) {
      // y is always null here, but the analyser doesn't know that,
      // because the top if statement only saw the combined result
      // of the || expression.
   }
}

The third major limitation is related to missing method annotations. Specifically, Null Value Analysis does not implement the complex ContractAnnotationAttribute. This attribute would allow programmers to give the analyser highly specialized hints, such as "this function never returns true if the parameter is null".
Because Null Value Analysis doesn't support any equivalent annotation, methods such as string.IsNullOrEmpty can't be supported properly.

The final major limitation is the lack of external annotations.
The analyser currently only recognizes annotations in the source code itself. For third-party code, the only way to add annotations is to modify the source code. This is undesirable (not to mention highly impractical) and external annotations are definitely on the wishlist, but nothing is currently available.
In short, most libraries (including standard .NET libraries) are currently hostile to the analyser.

Final Words

I hope null value analysis ends up being useful. It can certainly work reasonably well in theory and in a few test projects (notably NRefactory, MonoDevelop and xsp) but the world of code is a lot vaster than that, so it's still unknown if it'll survive under the harsh light of real-world usage. Time will tell, I guess.

This year's GSoC is over and I passed the final evaluation. Overall, I think null analysis was the hardest thing I wrote.
I enjoyed working in MonoDevelop and NRefactory and I'll certainly try to contribute again, if I have some time to spare. I actually learned a few new C# features and I got exposed to some patterns and paradigms I hadn't given much though before.

And that's it for today.

Further Reading

Official NRefactory github repository (icsharpcode/NRefactory)
This is where the upstream version lives.

Summer of Code 2013 NRefactory github repository (mono-soc-2013/NRefactory)
This is where the students push their code to before trying to get it upstream.

Control flow graph - Wikipedia, the free encyclopedia
Wikipedia explanation of what a CFG is.

Halting problem - Wikipedia, the free encyclopedia
Undecidable problem (as in, problem for which an exact solution is known to be inexistent) on computers. A perfect null value analyser would be capable of solving the halting problem and, as such, is impossible to make.

No comments:

Post a Comment