Writing code that runs quickly is sometimes at odds with writing code quickly. C.A.R. Hoare, computer science luminary and discoverer of the QuickSort algorithm, famously proclaimed, "Premature optimization is the root of all evil." The extreme programming design principle of "You Aren't Gonna Need It" (YAGNI) argues against implementing any features, including performance optimizations, until they're needed.
Writing unnecessary code is undoubtedly bad for work efficiency. However, it's important to realize that different situations have different needs. Code for vehicular real-time control systems has inherent up-front responsibilities for stability and performance that aren't present in, say, a small one-off departmental application. Therefore, it's more important in such code to optimize early and often.
Performance tuning for real-world applications often involves activities geared towards finding bottlenecks: in code, in the network transport layer, and at transaction boundaries. However, these techniques alone cannot solve the dreaded problem of uniformly slow code, which surfaces when large bottlenecks have been resolved but the code still exhibits inadequate performance. This is code that has been written without attention to correct usage, often by junior programmers, in the same style across whole modules or applications. Unfortunately, the best solution for this problem is to make sure that all programmers on a project follow correct coding practice when writing the code the first time; coding guidelines and good shared libraries help enormously.
This article presents helpful tips for writing in-process .NET managed code that performs well. It's assumed that basic programming skills such as factoring control structures, pulling work outside of loops whenever possible, caching variables for reuse, use of the switch statement, and the like are known to the average reader.
All code examples referred to in this article can be downloaded from the .NET Developer's Journal Web site, at www.sys-con.com/dotnet/sourcec.cfm. The code comes with a Windows Forms application that can be used to easily view the code and run all the tests on your own machine. You'll need the .NET Runtime 1.1 to run the code.
Tools
While testing tools such as NUnit and the upcoming VS.NET 2005 Team System can help you find bottlenecks, when tuning small sections of code, there's still no substitute for the micro-benchmark. This is because most generic testing frameworks depend on things like delegates, attributes, and/or interface method calls to do testing, and the code usually is not written with benchmarking primarily in mind. This can be very significant if you're interested in measuring the execution time of a batch of code down to the microsecond or even nanosecond level.
A micro-benchmark consists of a tight loop isolating the code that's being tested for performance, with a time reading before and after. When the test has finished, the start time is subtracted from the end time, and this duration is divided by the number of iterations to get the per-iteration time cost. The following code shows a simple micro-benchmark construct:
int loopCount = 1000000000;
long startTime, endTime;
double nanoseconds;
startTime = DateTime.Now.Ticks * 100;
for(int x = 0; x < loopCount; x++) {
// put the code to be tested here
}
endTime = DateTime.Now.Ticks * 100;
nanoseconds = ((double)(endTime - startTime)) / ((double)loopCount);
Console.WriteLine(nanoseconds.ToString("F") + " ns per operation");
When performing a simple micro-benchmark, it's important to remember a couple of things. First, small fluctuations (noise) are normal, so to obtain the most accurate results, each test should be run several times. In particular, the first set of tests executed after program initialization may be skewed due to the lazy acquisition of resources by the .NET runtime. Also, if your results are very inconsistent, you may not have penetrated the "noise floor" of the measurement. The best solution for this is to increase the number of loops and/or tests. Another thing to remember is that looping itself introduces overhead, and for the most accurate readings, you should subtract this from the result. On a P4-M 2-GHz laptop, the per-loop overhead for a for loop with an int counter in release mode is around 1 nanosecond.I'd never advocate running each code fragment from a long program through micro-benchmarks, but benchmarking is a good way to become familiar with the relative costs of different types of expressions. True knowledge of the performance of your code is built on actual observations. As time goes on, you'll find yourself needing such tests less and less, and you'll keep track in the back of your head of the relative performance of the statements you're writing.
Another important tool is ildasm.exe, the IL disassembler. With it, you can inspect the IL of your release builds to see if your assumptions are correct about what's going on under the covers. IL is not hard to read for a person familiar with the .NET framework; if you're interested in learning more, I suggest starting with Serge Lidin's book on the subject.
A great free tool for decompiling IL to C# or VB source, Reflector, is found at www.aisto.com/roeder/dotnet/; it's incredibly useful for viewing code that ships with the .NET Framework, for those of you less familiar with IL.
The CLR Profiler, available as a free download from Microsoft's Web site, allows you to track memory allocation and garbage collection activity, among other useful features. Also, the MSDN Web site has excellent coverage of performance metrics tools such as WMI and performance counters.
Working with Objects and Value Types
Objects: A Double Whammy
Objects are expensive to use, partly because of the overhead involved in allocating memory from the heap (which is actually well-optimized in .NET) and partly because every created object must eventually be destroyed. The destruction of an object may take longer than its creation and initialization, especially if the class contains a custom finalization routine. Also, the garbage collector runs in an indeterministic way; there's no guarantee that an object's memory will be immediately reclaimed when it goes out of scope, and until it's collected, this wasted memory can adversely affect performance.
The Garbage Collector in a Nutshell
It's necessary to understand garbage collection to appreciate the full impact of using objects. The single most important fact to know about the garbage collector is that it divides objects into three "generations": 0, 1, and 2. Every object starts out in generation 0; if it survives (if at least one reference is maintained) long enough, it goes to 1; much later, it transitions to 2. The cost of collecting an object increases with each generation. For this reason, it's important to avoid creating unnecessary objects, and to destroy each reference as quickly as possible. The objects that are left will often be long-lived and won't be destroyed until application shutdown.
Lazy Instantiation/Initialization
The Singleton design pattern is often used to provide a single global instance of a class. Sometimes it's the case that a particular singleton won't be needed during an application run. It's generally good practice to delay the creation of any object until it's needed, unless there's a specific need to the contrary - for instance, to pre-cache slow-initializing objects such as database connections. The "double-checked locking" pattern is useful in these situations, as a way to avoid synchronization and still ensure that a needed action is only performed once. Lazy initialization is a technique that can enhance the performance of an entire application through object reduction.
Avoiding Use of Class Destructors
Class destructors (implemented as the Finalize() method in VB.NET) cause extra overhead for the garbage collector, because it must track which objects have been finalized before their memory can be reclaimed. I've never had a need for finalizers in a purely managed application.
Casting and Boxing/Unboxing Overhead
Casting is the dynamic conversion of a type at runtime to another, and boxing is the creation of a reference wrapper for a value type (unboxing being the conversion back to the wrapped value type). The overhead of both is most heavily felt in collections classes, as they all - with the exception of certain specialized ones like StringDictionary - store each value as an Object. For instance, when you store an Int32 in an ArrayList, it is first boxed (wrapped in an object) when it is inserted; each time the value is read, it is unboxed before it is returned to the calling code.
This will be fixed in the next version of .NET with the introduction of generics, but for now you can avoid it by creating strongly typed collections and by typing variables and parameters as strongly as possible. If you're unsure about whether or not boxing/unboxing is taking place, you can check the IL of your code for appearances of the box and unbox keywords.
Trusting the Garbage Collector
Programmers new to .NET sometimes worry about memory allocation to the point that they explicitly invoke System.GC.Collect(). Garbage collection is a fairly expensive process, and it usually works best when left to its own devices. The .NET garbage collection scheme can intentionally delay reclamation of objects until memory is available, and in particular longer-lived objects (those that make it to generation 1 or 2) may not be reclaimed for an extended period. Even a simple "Hello, world!" console application may allocate 15 MB or more of memory for its "working set." My advice: don't call GC.Collect() unless you really know what you're doing.
Properties, Methods, and Delegates
Avoiding Overuse of Property Getters and Setters
Most people don't realize that property getters and setters are similar to methods when it comes to overhead; it's mainly syntax that differentiates them. A non-virtual property getter or setter that contains no instructions other than the field access will be inlined by the compiler, but in many other cases, this isn't possible. You should carefully consider your use of properties; from inside a class, access fields directly (if possible), and never blindly call properties repeatedly without storing the value in a variable. All that said, this doesn't mean that you should use public fields! Example 1 demonstrates the performance of properties and field access in several common situations.
About Delegates
Delegates are slower to execute than interface methods. Delegates are often used to introduce a level of indirection in code, but in almost all cases interfaces allow a cleaner design. Of course, it's impossible to completely shun delegates; the entire event-handling paradigm in .NET is based on them. Example 2 compares the performance of delegates and direct method calls.
Minimizing Method Calls
The .NET compiler is capable of performing many optimizations for release builds. One of them is called "method inlining." If method A calls method B and certain other conditions are met, such as the code in method B being small enough, the code from B will be copied into A during compilation. However, .NET won't or can't inline certain types of methods, such as virtual methods or methods over a certain size. Each method invocation/property access entails significant overhead, such as the allocation of a stack frame, etc. Of course, you should never repeatedly call a method for the same result on purpose, but you should also be mindful of the impact of method calls in general.
Using the 'Sealed' Keyword
Wherever extensibility is not required, you should use the sealed keyword. This makes your design easier to understand, as someone can tell at a glance if a certain class or method isn't meant to be extended or overridden. It also increases the chances that the compiler will inline code.
Working with Collections
Avoid Overuse of Collections
This might sound strange, but I'm not advocating working without data structures. The fact is, I've seen collections used many times when they don't actually simplify the code or provide any benefit at all. The single biggest avoidable use of collections is the use of an ArrayList when a simple array would suffice. It should be obvious that there's no way that calling a collection method - which may do significant work under the covers - can compare to something as simple as an array access for performance. See Example 3 for more information.
for vs foreach
The rumor abounds that the foreach loop is bad for performance. The truth is actually a little more complicated. Basically, foreach involves no performance penalty when used against arrays. However, when used against lists it involves the same overhead as creating an enumerator and using it within a try/catch block! Ildasm.exe may come in handy here to see what's going on. This isn't a complete killer, but if you do enough list access you may want to avoid it. Example 4 compares the two statements for access against an array and an ArrayList.
Enumerators: Don't Go Overboard
Just because the possibility exists for the enumerating of a collection doesn't mean you have to do it. For instance, the ArrayList class is useful as an array replacement; one of its best features is indexed element access. By using an enumerator with ArrayList, you hide its most useful features and introduce needless overhead. Example 5 shows the performance difference between indexed access and the use of an enumerator.
Avoid Overuse of Collection Wrappers
A peek at the code in classes such as ArrayList shows that, to implement synchronized, fixed-size, and read-only versions of these, methods such as Synchronized() actually create a new wrapper list around the original one (this scheme was copied from Java). While this is nice from certain design standpoints, it degrades performance; the method-call overhead for each operation is multiplied. For a fixed-size, synchronized ArrayList, this overhead is tripled! Chances are, if you're working with a collection and need synchronization, the code using the collection is already synchronized. In any case, simply locking on the collection itself around every access turns out to be faster than using a synchronized wrapper. Example 6 compares the use of a synchronized ArrayList with synchronizing access to an ArrayList.
Working With Strings
Don't Use String.Format() to Concatenate Strings
While string-formatting routines built into .NET are very useful for globalization and other tasks, they're not meant to be used for appending strings to each other. Example 7 tests the performance difference between String.Format() and string concatenation.
Use StringBuilder to Build Strings Inside Loops
The StringBuilder class is basically an array list for string fragments; the StringBuilder takes care of expanding its internal char array, hiding this from the user. The use of this class is also specially optimized by the .NET compiler, making it impossible to duplicate this functionality with equivalent performance (for instance, by manipulating char arrays directly and converting to a string afterwards). Example 8 shows the performance benefit of building a large string using StringBuilder instead of string concatenation.
Don't Use StringBuilder to Concatenate Small Numbers of Strings
Many .NET developers who consider themselves well-versed in performance matters advocate the use of the StringBuilder class whenever possible. However, it's not the fastest approach for concatenating small numbers of strings. Actually, any number can be combined in a single statement, although the performance benefit tends to dwindle above five or six substrings. This is due to instantiation and destruction overhead for the StringBuilder instance, as well as method-call overhead involved in calling Append() once for every added substring and ToString() once the string is built. What are the alternatives? Plus-sign concatenation and the String.Concat() method are equivalent; I prefer plus signs for readability. Note that this cannot be used across loops because the entire concatenation must occur within a single statement. Example 9 tests String.Concat() vs. StringBuilder for various numbers of substrings.
Don't Be Afraid to Use String Literals
Many developers incorrectly assume that a new object is created for every string literal, and, therefore, avoid their use. In some cases, using string literals directly in your code is a better approach than using string constants! It can make the code easier to understand and has no adverse impact on performance. This is due to the use of the .NET interned string table; this table maintains a String instance for every known string and reuses this instance whenever the same character sequence is used as a literal. See the documentation of the String.Intern() method for more details. Example 10 compares string literals to the use of string constants.
Threading
In itself, this is a huge topic. Most easy-to-apply, synchronization-related optimizations focus on minimizing the amount of code executing inside synchronized blocks, which may involve moving some code from inside to outside such blocks. In some cases, thoughtful programming may allow elimination of synchronization entirely from a class (see "Immutable Objects").
Performing a multithreaded micro-benchmark is more complicated than running a simple loop. In the simplest method, multiple threads are spawned, each looping over the benchmarked code; the reading is not started until all threads are successfully executing the code, and is terminated before the threads are shut down.
Thread Reuse
Newbie programmers make the common mistake of spawning a new thread for every request or other action. This can result in worse performance than using a single-threaded approach; the relative performance degradation is worse the quicker the individual requests can be serviced.
Writing Your Own Threading Code
Even better than the .NET thread pool, with its dependence on delegates, is to write your own threading code. Instead of using QueueUserWorkItem(), you typically write your own queueing code to coordinate work items. This also allows other benefits such as priority-based queueing.
Immutable Objects
Immutable objects are objects in which the data cannot be changed. In most cases, this is achieved by setting all fields in constructor methods and providing only property getters and/or methods that retrieve data from the object, without any mutator logic whatsoever. Many classes in the .NET Common Type System are immutable: System.String, System. Drawing.Font, etc. In addition, care should be taken that any values returned from property accessors, etc. are immutable as well. Otherwise, this data may be copied to insure the integrity of the object itself. Example 11 shows the performance benefit of immutable objects over synchronization.
Data Copying
This flies in the face of the advice given earlier, to minimize the use of objects. However, it's really the other side of the coin from immutable objects. Object copying allows you to use the data in a non-immutable object, but in a way that still completely avoids synchronization. The more highly multithreaded the environment, the more strategies like this make sense.
Read-Write Locks
Synchronization issues in managed code mirror those in databases. In some situations, optimistic concurrency strategies can be used; in some dirty reads are acceptable, etc. For situations in which a structure is seldom updated and often read, the ReaderWriterLock class can give significant performance benefits over simple synchronization. It allows either multiple read access or single write access at once. Example 12 compares ReaderWriterLock to simple synchronization in a read-heavy scenario.
Minimizing Synchronized Blocks
The use of the [MethodImpl Attribute(MethodImplOptions.Synchronized)]attribute should be avoided, as it always locks an entire method and is also non-standard C# usage. Instead, the lock keyword or one of the System.Threading classes should be used. Wherever possible, adjust the start of a synchronized section forward and the end backward. Do whatever you can to decrease the number of synchronized operations.
Suggested Reading