In the last installment of CLR Inside Out, I stressed that to reliably create high-performance programs, you need to understand the performance of the individual components you use early in the design process (see msdn2.microsoft.com/magazine/cc424899). To do this you need performance data. Thus, measurement has to be an integral part of the design process.
At that time I also introduced a tool called MeasureIt that makes it very easy to create new benchmarks so you can quickly gather the data you need to make good design decisions. The raw numbers that a tool like MeasureIt provides are critically important, but it is also important to develop a general understanding of what the numbers mean. This understanding allows you to predict the performance of something before you actually measure it. That is what I'll discuss here.
If you have not already downloaded the MeasureIt tool, I strongly encourage you to do so now. It's in the code download for this column, found on the MSDN® Magazine Web site, and it consists of a single EXE file. Running it will generate a Web page displaying the results of running a suite of benchmarks. Additional documentation can be accessed after installation by running the following command:
measureIt /usersGuide
MeasureIt comes complete with its source code, which can be conveniently unpacked using the /edit qualifier. This makes adding a new benchmark as easy as writing one or two lines of code (and supplying the code to be timed). See the user's guide for more on exactly how to do this.
MeasureIt benchmarks are associated with different performance areas, which can be indicated on the command line when the tool is launched. By default (with no command-line parameters), MeasureIt runs a suite of 50 or so benchmarks covering a broad variety of basic Microsoft® .NET Framework runtime operations. Abbreviated sample output is show in Figure 1.
MeasureIt runs each benchmark 10 times and computes statistics based on the results. The reported values are scaled so that a single call to an empty method takes one unit of time. For example, because Figure 1 shows the median time of allocating a small object to be 5.06, you know it typically takes only a bit more than five times as long to allocate a small object as it does to call a method. But that is not the whole story. Notice that the maximum time for object allocation is more than 51 units. Thus there are times when it can take much longer than the average case. In fact, if the benchmark was unlucky enough to force a full garbage collection of the heap, it might spend considerably more time in the method than the maximum value that is reported here.
Already, however, you should be able to see the value of the MeasureIt harness. With almost no work at all, you can immediately get a rough idea of how expensive small allocations are. Just as important, because the harness collects multiple samples and computes statistics, you can also understand that some operations, such as object allocation, can have considerable variation. Having minimum, maximum, and standard deviation information lets you determine whether or not the measurements are stable enough to be relied upon.
Figure 1 provides you with a rough idea of the cost of object allocation of small objects in the .NET Framework. However, you can quickly deduce quite a bit more. For example, you can surmise that objects with finalizers—declared in C# as ~ClassName()—are more than 10 times as expensive as objects that don't have finalizers. Worse, finalizers are inherited, which means that allocations of instances of any subclass will be similarly penalized. Thus, if at all possible, finalizers should generally be avoided on a class that will have many instances.
You can also infer that allocating an object using reflection (as in Activator.CreateInstance) is more than 10 times as expensive as allocating an object normally. This will be a common theme with reflection APIs: they are expensive compared with their static equivalents, and you should not use them on performance-critical paths.
Furthermore, you can also conclude from the data that calling a method using reflection (as in MethodInfo.Invoke) is more than 450 times slower than calling the method statically (the ratio is so high in part because calling a method normally is very cheap). Likewise, you are also able to deduce that array access is cheap (less than the cost of a method call), but setting an element in an array of object references (as in string[]) is more than 4 times as expensive as a normal array set. You can also see from Figure 1 that invoking a delegate (pointer to a method) is only about 20 percent slower than invoking a method whose target is known at compile time.
Finally, you can tell that when security checks are suppressed, invoking unmanaged code (P/Invoke) is not too expensive (6 times a normal method call), and the average cost can get even cheaper (2.6 times) if many calls are made from the same method. However, if security checks are used (which is the default), the cost is significantly more (27 to 30 times a normal method call).
Already, some useful observations have been made from the built-in MeasureIt benchmarks. Moreover, since the MeasureIt download includes source code, you can get all the necessary details to drill deeper into the data. For example, you may want to know exactly how to suppress the security checks for P/Invoke calls (or even how to make a call to native code in the first place). To do so, just unpack the source using the following command:
measureIt /edit
Then search for the P/Invoke benchmark. You'll get the exact code you need, which you can examine, debug, and even modify for your own purposes.
While MeasureIt quickly tells you the costs of some of the basic operations performed by the .NET runtime, it will not tell you why the numbers are what they are. It is very possible that you are not measuring what you think you are measuring.
In last month's column I highlighted how easy it is to create misleading benchmarks (especially micro-benchmarks), underscoring the importance of validating performance measurements before relying on them. Remember that you must also validate the data from the built-in benchmarks.
Here is where some expertise in the internals of .NET runtime is helpful. By knowing exactly how operations within the runtime are translated into machine instructions, I can make estimates on how long various operations should take. This is summarized in Figure 2. Because of the optimizations that the hardware itself performs, instruction counts do not map precisely to execution time; however, they do provide a good ballpark estimate. If you ever find that the instruction execution time far exceeds your expectations based on the instruction count, it's more than likely that the data is incorrect. The data shown in Figure 2 is best thought of as a qualitative view of the operations, which nicely compliments MeasureIt's quantitative view.
The information in Figure 2 is useful for validating the performance numbers that MeasureIt provides, but it also allows you to make rough predictions about the performance of your programs and the costs of various design trade-offs. For example, you can predict the cost trade-off between two code factoring designs that differ in how often they cross a managed-to-unmanaged (P/Invoke) boundary. You can also make predictions about how much would be saved by avoiding reflection methods or the cost of adding locks to a program to make it thread safe.
Personally, I have found the data from MeasureIt to be very helpful in making many performance decisions. However, the built-in benchmarks focus on CPU-intensive operations. If your application's performance is not limited by the CPU, a focus on CPU optimization is misguided. Three common scenarios where CPU is not the critical performance factor are when the application response time is dominated by I/O costs or network latency, when the application response time is dominated by memory caching costs, and when, in a multithreaded application, the application response time is impacted by thread serialization delays.
The performance of some applications is limited by the speed at which they can execute disk and network I/O, which is many times slower than actual instruction execution. This can happen on application startup when the application requests files that are not yet cached in memory (cold startup). It can also happen when your application consumes enough memory to cause many page faults. In general, if your application is not consuming 100 percent of a processor, then these other latencies are important. Most of the information I present here is not relevant to this case since this column focuses on identifying CPU-bound performance bottlenecks.
In other applications, response time is dominated by memory caching costs. This typically happens when you have large, in-memory data structures. The application looks CPU bound (it consumes 100 percent of a processor), but the CPU is actually waiting on the memory subsystem most of the time. You need a profiler with access to statistics on the memory subsystem to diagnosis this properly, but large memory consumption (> 50MB) and a large number of page faults are usually an indication that memory issues are a factor. Generally, your design strategy in this case should be to make things small (even if it increases the number of instructions executed).
In multithreaded applications operating on multiprocessor hardware, it is not uncommon for the threads to be delayed as they wait to enter critical sections of code—code that must be accessed by only one thread at a time. These serialization delays tend to grow worse as the number of concurrently executing threads grows. This lock contention can show up either as being non-CPU bound (when the wait time is long and the waiting thread goes to sleep) or CPU bound (when the wait time is very short but happens very frequently).
Sadly, explaining these non-CPU issues well enough to impart good design intuition would require another full column or two. Just remember that memory consumption does matter (especially for large applications), and you should try to decrease the size of data structures that are larger than 1MB if the trade-off in instruction count is modest.
However, if your application consumes 100 percent of one CPU when it is running (with very few or no page faults), consumes a modest amount of memory when CPU bound (hot data structures are less than 1MB), and is not intended to be a highly parallelized application that relies on shared data, then your performance issue is likely related to the number of instructions that execute. Therefore, the information in Figure 2 and the data MeasureIt provides will be useful in predicting performance.
The .NET runtime team goes to some trouble to optimize operations whenever possible. Generally speaking, this is a great thing, but it does create an uneven performance landscape because there are cases where the optimization can't be applied. We have already seen an example of this: finalizable allocation. Here are some more examples.
First of all, interface calls are highly tuned for the case in which a particular call site tends to dispatch to the same target. For example, you could have a routine that takes a List<object> and calls ICompareable.Compare on each element. If this was passed a List of strings, the interface call would always go to String.Compare and would be fast. However, if the list contained many different types, then that same call site must dispatch to many different targets. This latter case is significantly slower than the homogenous case.
Second, when primitive types (such as int) need to be passed to operations expecting an object, the value must be converted to an object (this operation is known as boxing). This requires an object allocation and thus is more expensive than expected. Often this boxing is inserted by the compiler automatically, making it all too easy to fall into this trap.
Third, methods that use the variable argument params feature of C# (such as Console.WriteLine) need to allocate an array—and possibly box (allocate) objects for some parameters—for each call. This makes the call much more expensive than normal calls (by a factor of 10). There are dozens of more examples that I could easily list, but these will suffice.
While people who know about the internals of the runtime can warn you about some of these, realistically the list will never be complete. This is why the information in Figure 2 and the data gathered from MeasureIt should only be used as a rough approximation. You never know when you might fall into a bad performance code path. Thus it pays to be cautious: if there is even a possibility that your hot code path might not be optimized, you should not rely solely on deductions made from Figure 2 but instead write a quick micro-benchmark that very accurately represents your hot code path and measure it yourself.
Send your questions and comments to clrinout@microsoft.com.