Last week, I wrote a somewhat lengthy blog post describing a number of things I’ve done recently to improve Castalia’s speed and responsiveness. Several of you asked for more information about the instrumentation that I used to measure performance times. In this post, I’ll describe two techniques for measuring the speed of a segment of code.
The very easy way
In a nutshell, measuring the speed of a segment of code is incredibly easy. You look at what time it is when the code starts running, then you look again when it’s done, and the difference tells you how long it took the code to run.
So, the only thing we need to figure out how to do is how to “look at what time it is.” I used two mechanisms to do this. The first one is the GetTickCount function. GetTickCount is a Windows API, not part of the VCL. It simply returns the number of milliseconds that have elapsed since the computer was started up. Here’s an example of how to use GetTickCount:
var StartTime, EndTime, Delta: Integer; begin StartTime := GetTickCount; //Code you want to measure here EndTime := GetTickCount; Delta := EndTime - StartTime; //Show Delta so you know the elapsed time end;
It’s really that simple. This will tell you how many milliseconds it took to execute the code in begin measured.
There is one problem with GetTickCount, however: It’s not reliably accurate for timespans less than about 15 milliseconds. Above that, it’s fine, and is OK for an approximation to figure out the big bottlenecks might be, but what if you want to measure the execution of a function that’s a lot faster than 15ms already?
The slightly less easy way
The next way takes just a little bit more work, but not much. Windows includes a mechanism called Performance Counters, intended for exactly this purpose. In working with Castalia, I used the QueryPerformanceCounter function exactly the same way as GetTickCount: Get the time at the start of the code I’m measuring, get the time again at the end, and record the difference.
While GetTickCount measures time in milliseconds, QueryPerformanceCounter’s measurement of time is hardware-dependent. In fact, QueryPerformanceCounter relies on the CPU having a high-resolution performance counter, and won’t work if your CPU doesn’t have one (as far as I’m aware, all modern desktop CPU’s have one, so you probably don’t need to worry about that).
Technical historical aside: At its inception, QueryPerformanceCounter was intended to be an easy way to access the Pentium CPU’s RDTSC instruction, which returns the number of CPU cycles that have elapsed. MultiProcessor/MultiCore/Hyperthreaded architectures have made measuring CPU cycles unreliable, so now RDTSC simply measures some very small unit of time.
Because it’s measuring much faster units of time, and needs much larger numbers, QueryPerformanceCounter requires an Int64. Here’s an example of using QueryPerformanceCounter to measure the time of some code:
var StartTime, EndTime, Delta: Int64; begin QueryPerformanceCounter(StartTime); //Code you want to measure here QueryPerformanceCounter(EndTime); Delta := EndTime - StartTime; //Show Delta so you know the elapsed time end;
That’s very simple too. We can use those numbers for comparisons as-is, but I like to translate them into milliseconds in order to get a more familiar unit of measure that’s also comparable between different systems. To do that, we need to know exactly what unit of time QueryPerformanceCounter is measuring. To get that information, we use the function QueryPerformanceFrequency.
QueryPerformanceFrequency tells us the number of units per second that the high-resolution performance counter is using. It also takes an Int64. We only need to call it once per application, since it doesn’t really change (actually, it can if the CPU gets throttled for power/temperature reasons, but lets assume that it won’t happen while you’re running your tests). To find out how many high-resolution units are in a millisecond, do something like this:
var Frequency, UnitsPerMS: Int64; begin QueryPerformanceFrequency(Frequency); UnitsPerMS := Frequency div 1000; end;
Then you can divide your performance Delta by UnitsPerMS to find out how many milliseconds your code is actually taking to run. For example, in a VM on my Sandy Bridge i7, Frequency = 3,579,545. That means there are approximately 3,580 “units” per millisecond. If the QueryPerformanceCounter delta on some code is 385, that means the code took 0.1 milliseconds to run.
That’s a MUCH higher resolution than GetTickCount
Why not use QueryPerformanceCounter all the time?
There’s one problem with QueryPerformanceCounter: It can be very slow compared to GetTickCount. On the same Sandy Bridge i7 as above, with Windows 7, QueryPerformanceCounter is well over 100 times slower than GetTickCount.
This probably only matters if you’re measuring the speed of code that includes calls to QueryPerformanceCounter. This is easy to do (and not realize you’re doing) if you measure the time of a function that calls another function which you’re also instrumenting. That’s just something to keep in mind and be careful about.
If you’re wanting to ship instrumented code to users, be very careful about overusing QueryPerformanceCounter to do it; it can actually negatively affect the speed of your code.
Showing the information
In the code examples above, I sort of skipped an important step, which is actually displaying the delta values. There are a few ways to do that, each with their own set of caveats.
For a VCL Forms application, the easiest is probably with a call to ShowMessage(IntToStr(Delta)). This shows you the number quickly, but it also blocks the application, and would be downright impossible to manage if you were measuring each iteration of a large loop or each level of a deeply nested set of function calls.
Another alternative would be the OutputDebugString API. OutputDebugString logs any string to the debugger. In Delphi, this string will appear in the “Messages” window in the IDE. Use this when you want to record a lot of data as you use your application, then analyze it later. Beware, however, that this has the same caveat as QueryPerformanceCounter: It’s not a very fast function, and overusing it can have negative effects on your code’s performance. If you use calls to OutputDebugString in your performance-sensitive code, you’ll definitely want to remove them in your release build.
Finally, you could store the data internally, then show the result once using one of the mechanisms above. For example, if you were testing the speed of each iteration of a loop, and the loop was running 10,000 times, you could create an array of 10,000 integers, put the Delta from each loop in each one, and then display them all at once however you like (or, calculate an average and display that). This is the approach I used most with my recent work on Castalia, and it worked very well.
Final thoughts
If you want to improve performance, you have to first measure performance. Using GetTickCount or QueryPerformanceCounter, you can measure the speed of your code with varying degrees of reliability and resolution. Choose the one that best meets your needs.
Appendix A (Added about 7 hours after this post was initially published)
Reader Alan C posted a comment below pointing out the TStopWatch utility class, which encapsulates all of the above into a Delphi-friendly class. Because Castalia is compatible with Delphi versions back to Delphi 5, and TStopWatch was added in Delphi 2010, I didn’t use it in the work I’ve done with Castalia this year.
TStopWatch does indeed encapsulate the functionality described above, defaulting to use the high resolution timer if it’s available, and using GetTickCount otherwise. Still, even if you use TStopWatch, it’s good to understand the limitations and performance impact that it can have. Thanks Alan for pointing this out!
This post was written by Jacob Thurman, creator of Castalia, the Smart Code Editor for Delphi Programmers. Check out Castalia at http://www.twodesk.com/castalia.
The next step, now that I had this data, was to figure out where the problem actually was. The question I needed to answer was “What code is actually creating the problem?”
Sometimes, when we’re fixing bugs or improving performance, we find a single fix that solves the problem. There have been many times where a seemingly complex bug was, in-fact, the result of a single-line error that only needed a single-line fix.