Making Slow Code go Faster: Measuring Speed to Improve Performance

 

rulerLast week, I wrote a somewhat lengthy blog post describing a number of things I’ve done recently to improve Castalia’s speed and responsiveness. Several of you asked for more information about the instrumentation that I used to measure performance times. In this post, I’ll describe two techniques for measuring the speed of a segment of code.

The very easy way

In a nutshell, measuring the speed of a segment of code is incredibly easy. You look at what time it is when the code starts running, then you look again when it’s done, and the difference tells you how long it took the code to run.

So, the only thing we need to figure out how to do is how to “look at what time it is.” I used two mechanisms to do this. The first one is the GetTickCount function. GetTickCount is a Windows API, not part of the VCL. It simply returns the number of milliseconds that have elapsed since the computer was started up. Here’s an example of how to use GetTickCount:

var
  StartTime, EndTime, Delta: Integer;
begin
  StartTime := GetTickCount;
  //Code you want to measure here
  EndTime := GetTickCount;
  Delta := EndTime - StartTime;
  //Show Delta so you know the elapsed time
end;

It’s really that simple. This will tell you how many milliseconds it took to execute the code in begin measured.

There is one problem with GetTickCount, however: It’s not reliably accurate for timespans less than about 15 milliseconds. Above that, it’s fine, and is OK for an approximation to figure out the big bottlenecks might be, but what if you want to measure the execution of a function that’s a lot faster than 15ms already?

The slightly less easy way

The next way takes just a little bit more work, but not much. Windows includes a mechanism called Performance Counters, intended for exactly this purpose. In working with Castalia, I used the QueryPerformanceCounter function exactly the same way as GetTickCount: Get the time at the start of the code I’m measuring, get the time again at the end, and record the difference.

While GetTickCount measures time in milliseconds, QueryPerformanceCounter’s measurement of time is hardware-dependent. In fact, QueryPerformanceCounter relies on the CPU having a high-resolution performance counter, and won’t work if your CPU doesn’t have one (as far as I’m aware, all modern desktop CPU’s have one, so you probably don’t need to worry about that).

Technical historical aside: At its inception, QueryPerformanceCounter was intended to be an easy way to access the Pentium CPU’s RDTSC instruction, which returns the number of CPU cycles that have elapsed. MultiProcessor/MultiCore/Hyperthreaded architectures have made measuring CPU cycles unreliable, so now RDTSC simply measures some very small unit of time.

Because it’s measuring much faster units of time, and needs much larger numbers, QueryPerformanceCounter requires an Int64. Here’s an example of using QueryPerformanceCounter to measure the time of some code:

var
  StartTime, EndTime, Delta: Int64;
begin
  QueryPerformanceCounter(StartTime);
  //Code you want to measure here
  QueryPerformanceCounter(EndTime);
  Delta := EndTime - StartTime;
  //Show Delta so you know the elapsed time
end;

That’s very simple too. We can use those numbers for comparisons as-is, but I like to translate them into milliseconds in order to get a more familiar unit of measure that’s also comparable between different systems. To do that, we need to know exactly what unit of time QueryPerformanceCounter is measuring. To get that information, we use the function QueryPerformanceFrequency.

QueryPerformanceFrequency tells us the number of units per second that the high-resolution performance counter is using. It also takes an Int64. We only need to call it once per application, since it doesn’t really change (actually, it can if the CPU gets throttled for power/temperature reasons, but lets assume that it won’t happen while you’re running your tests). To find out how many high-resolution units are in a millisecond, do something like this:

var
  Frequency, UnitsPerMS: Int64;
begin
  QueryPerformanceFrequency(Frequency);
  UnitsPerMS := Frequency div 1000;
end;

Then you can divide your performance Delta by UnitsPerMS to find out how many milliseconds your code is actually taking to run. For example, in a VM on my Sandy Bridge i7, Frequency = 3,579,545. That means there are approximately 3,580 “units” per millisecond. If the QueryPerformanceCounter delta on some code is 385, that means the code took 0.1 milliseconds to run.

That’s a MUCH higher resolution than GetTickCount

Why not use QueryPerformanceCounter all the time?

slowThere’s one problem with QueryPerformanceCounter: It can be very slow compared to GetTickCount. On the same Sandy Bridge i7 as above, with Windows 7, QueryPerformanceCounter is well over 100 times slower than GetTickCount.

This probably only matters if you’re measuring the speed of code that includes calls to QueryPerformanceCounter. This is easy to do (and not realize you’re doing) if you measure the time of a function that calls another function which you’re also instrumenting. That’s just something to keep in mind and be careful about.

If you’re wanting to ship instrumented code to users, be very careful about overusing QueryPerformanceCounter to do it; it can actually negatively affect the speed of your code.

Showing the information

In the code examples above, I sort of skipped an important step, which is actually displaying the delta values. There are a few ways to do that, each with their own set of caveats.

For a VCL Forms application, the easiest is probably with a call to ShowMessage(IntToStr(Delta)). This shows you the number quickly, but it also blocks the application, and would be downright impossible to manage if you were measuring each iteration of a large loop or each level of a deeply nested set of function calls.

Another alternative would be the OutputDebugString API. OutputDebugString logs any string to the debugger. In Delphi, this string will appear in the “Messages” window in the IDE. Use this when you want to record a lot of data as you use your application, then analyze it later. Beware, however, that this has the same caveat as QueryPerformanceCounter: It’s not a very fast function, and overusing it can have negative effects on your code’s performance. If you use calls to OutputDebugString in your performance-sensitive code, you’ll definitely want to remove them in your release build.

Finally, you could store the data internally, then show the result once using one of the mechanisms above. For example, if you were testing the speed of each iteration of a loop, and the loop was running 10,000 times, you could create an array of 10,000 integers, put the Delta from each loop in each one, and then display them all at once however you like (or, calculate an average and display that). This is the approach I used most with my recent work on Castalia, and it worked very well.

Final thoughts

If you want to improve performance, you have to first measure performance. Using GetTickCount or QueryPerformanceCounter, you can measure the speed of your code with varying degrees of reliability and resolution. Choose the one that best meets your needs.

Appendix A (Added about 7 hours after this post was initially published)

Reader Alan C posted a comment below pointing out the TStopWatch utility class, which encapsulates all of the above into a Delphi-friendly class. Because Castalia is compatible with Delphi versions back to Delphi 5, and TStopWatch was added in Delphi 2010, I didn’t use it in the work I’ve done with Castalia this year.

TStopWatch does indeed encapsulate the functionality described above, defaulting to use the high resolution timer if it’s available, and using GetTickCount otherwise. Still, even if you use TStopWatch, it’s good to understand the limitations and performance impact that it can have. Thanks Alan for pointing this out!

This post was written by Jacob Thurman, creator of Castalia, the Smart Code Editor for Delphi Programmers. Check out Castalia at http://www.twodesk.com/castalia.

Read full story · Comments { 6 }

Making Slow Code go Fast

This year’s Castalia releases have seen huge improvements in performance.  Because performance optimization is important to Delphi programmers, and something we all have to face at some point, I’ve decided to write about what I did to get those performance gains. I hope this will help you to improve the code you write too.

Identifying the Problem

The Qualitative Problem

slowThere had been “qualitative” concerns about Castalia’s performance for a while. Users would write things like “it’s slow” or “it feels sluggish.” Talking to users who had these complaints, I learned that the performance complaints were in two areas:

1. The code editor would lag behind when scrolling using the mouse scroll wheel

2. The code editor could lag behind when typing code, if you typed sufficiently fast enough.

So, at the beginning of 2013, I determined that I was going to get the bottom of these issues and fix them once and for all, even if I had to rewrite the whole parser in hand-optimized assembler code.

The Quantitative Problem

rulerThe secret of real performance improvement is measurement. I needed some way to compare Castalia’s performance to the performance of Delphi without Castalia, and then to measure whether the improvements I would make were actually having an impact. Qualitative descriptions like “slow” or “sluggish” aren’t things you can compare; I needed real numeric data.

I spent a day writing some code instrumentation. In other words, I wrote a simple framework that would allow me to measure the exact time that certain operations took. I could time the execution of a function, a single line of code, or response to an event like a key press, mouse click, or a paint event in the code editor.

Next, I gathered data on current performance. First, I ran Delphi with only my instrumentation enabled, and no other Castalia features. This allowed me to see how fast Delphi responds to key presses, scroll wheel motion, and other events.

Then, I enabled all of Castalia’s features, and ran the exact same tests, collecting data to see how Delphi responded to those same events, but with Castalia doing its thing along the way.

Comparing these two data sets gave me the difference in performance between Delphi by itself and Delphi with Castalia. Now, instead of words like “slow” or “sluggish,” I had actual numbers. This turned a qualitative problem into a quantitative problem, with good measurements, and things that can be measured can be improved!

Finding Bottlenecks

trafficjamThe next step, now that I had this data, was to figure out where the problem actually was. The question I needed to answer was “What code is actually creating the problem?

I have to admit, I started down the wrong path on this. I had gotten a lot of emails from users saying things like “It’s slow because the parser runs after every keystroke,” and “The parser isn’t fast enough for this enormous file I’m working on.” Those are very reasonable complaints, and since certain parts of the Castalia parser do need to run for every keystroke in the editor, it had better be incredibly fast, even on very large files.

After several days of work, I had to come to the conclusion that the parser wasn’t the problem. The parser was already blindingly fast. It could churn through huge files in single-digit milliseconds. I even wrote a quick program to generate syntactically correct gibberish source code files, and had it generate a file of over 300,000 lines of code, which the Castalia parser handled in just a few milliseconds. The parser was plenty fast. The problem had to be somewhere else.

In hindsight, my next strategy was much smarter. I commented out the initialization code for every Castalia feature except the code that was necessary for the performance measurements. Then, I gradually re-enabled features one at a time, running the performance tests and collecting data again.

When enabling a feature created a spike in the response time to some event, I would add more and more instrumentation code as needed, until I had narrowed it down to some specific thing that I could work on optimizing. Then, I would work through that bottleneck, bringing the performance time down as low as possible.

Fixing the Problem(s)

toolsSometimes, when we’re fixing bugs or improving performance, we find a single fix that solves the problem. There have been many times where a seemingly complex bug was, in-fact, the result of a single-line error that only needed a single-line fix.

Unfortunately, there was no single fix that solved all of Castalia’s performance problems. The huge gains in performance that have happened this year are the cumulative result of a number of big improvements. There was one that was bigger than all the others, but it certainly wasn’t a small single-line fix. More on that later…

For the remainder of this post, I’m going to outline a few specific bottlenecks that existed, and how I fixed them.

Getting Source Code from the Editor

Castalia has a function called GetSourceFromModule that gets source code out of the Delphi editor so Castalia can do things with it. This routine was written years ago, before the IDE supported Unicode, and Unicode support had been rather inelegantly shoved in. Since this routine gets called fairly frequently, and wasn’t as fast as it could be, I rewrote the whole function, making a number of changes to make it faster:

  • First, I learned that it’s faster to declare a variable of type AnsiString and call SetLength multiple times than to allocate a new PAnsiChar each time you need a string of a different size.
  • I wrote smarter string manipulation code to eliminate a memory allocation and a string concatenation.
  • Finally, some smarter coding eliminated a string conversion from UTF8 to WideString for the Unicode-Enabled versions of Delphi.

These changes made the GetSourceFromModule routine 20% faster, and made overall keystroke response 7% faster.

Caching

Caching is a large and complex subject, about which entire books (and several dissertations and theses) have been written. The basic idea of caching is this: If generating data is expensive, but figuring out whether it needs to be regenerated is cheap, then put the generated data in a cache that’s cheap, and only re-generate it when it has changed, instead of every time it’s needed.

The GetSourceFromModule routine described above is a good example of this. Getting the source code out of the editor is actually fairly expensive (where “expensive” means time-consuming).  But figuring out if the code has changed since the last time you read it is much cheaper. So, instead of every Castalia feature that wants the source code calling GetSourceFromModule, there’s now a cache layer that takes care of calling GetSourceFromModule once for each time the code changes, and caching it into a string that it can give to Castalia’s various sub-modules much more quickly.

There were several more places where I implemented a caching strategy. The net result was an 11% improvement in keystroke response time.

Record Pooling

Castalia’s structural highlighting requires a fairly complex data structure to be generated. This data structure contains info about all of the extra syntax highlighting that Castalia needs to draw into the code editor. Because this structure has to be generated every time the code changes (because a code change could be a structural change), generating the structure needs to be incredibly fast.

This data structure consists of a large number of records, allocated with a call to New and destroyed with a call to Dispose. Calling New and Dispose thousands of time for each regeneration of the tree proved to be a bottleneck, so I implemented a “pooling” strategy.

In a nutshell, now Castalia’s structural highlighting sort of has its own special memory manager. When a record in the data structure is freed, instead of releasing the memory back to the actual memory manager, it remains in a “pool” of already-allocated but currently-unused data structures. When a new one is needed, instead of allocating one through the memory manager, the next empty one from the pool is used. If the pool is “empty,” then more records are allocated by the real memory manager and thrown into the pool.

Unfortunately, I seem to have misplaced the spreadsheet that had the performance numbers for this improvement, so I can’t tell you what the improvement was exactly, but I remember that it was significant. If I find that document in the future, I’ll update this post with the right numbers.

Navigation Toolbar Repopulation

This was the huge one. While it seemed at first glance that the big bottleneck in performance was the parser, it was, in fact the navigation toolbar. Both the scroll wheel and keystroke response were being affected by this (though the keystroke response issue was only noticeable if you typed fast – considerably faster than I usually do while coding).

The culprit was repopulating the dropdown boxes for classes and methods. In a unit of significant size, an extremely fast typist, or when scrolling, there would be a noticeable “pause” every few keys or every few scroll lines while the navigation toolbar updated.

There were several fixes (and a few missteps) that contributed to the resolution of this problem:

First, caching the data: There’s no need to repopulate the toolbar if nothing has changed. This largely eliminated the scrolling issue, but not the keystroke issue, since a keystroke would usually trigger a change, which would render the cache invalid, triggering the slow repopulation.

Next, get rid of the timer that was periodically looking to see if it needed to update. Instead, make the update event-driven. This is basically a tweak on the caching fix above, so it suffers from the same problem: The problem only occurs when the data is in a state of change anyway.

Now for a misstep: The obvious instinct here is to make the repopulation faster. Faster sorting, build a faster data structure, etc… Part way into this effort, I realized that There’s no need to populate the dropdowns every time the file changes. Most of the time, they only need to show the current class and method, and I could rewrite them to do that without having to fully repopulate all of the data. So…

New plan. Don’t REpopulate the dropdowns at all. Now, the dropdowns are only populated with all of their items on demand, when they’re dropped down. I ended up writing a new dropdown control specifically for this (they were just TComboBoxes before). The new navigation toolbar doesn’t keep a data structure of all the classes/methods in your file. It only knows the name of the current class/method, and then only populates the data in those dropdowns when you click on them.

THEN, with the improved dropdown in place, I made the population very fast, using some very different techniques than I would have used if I had just been trying to make the old TComboBoxes repopulate faster. (I also wrote some major search improvements along the way, since I was rewriting the control anyway).

The final performance improvement with the navigation toolbar changes was HUGE. For many cases, the navigation toolbar now refreshes so fast that its new performance can’t even be measured as a fraction of the old time (that is, what used to take 400ms now takes <1ms). That’s at a big improvement, right?

Final Thoughts

Performance optimization can be both frustrating and fun. Next time you’re tasked with an optimization project, remember a few key things:

  • There is important value in both qualitative and quantitative data. Quantitative data lets you measure and improve, but it’s the qualitative that makes users happy.
  • When it comes to performance optimization, never assume that the bottleneck is in any particular place. I was surprised over and over again when I found the real issues. I thought I would spend most of my time improving the parser, and that turned out not to be an issue at all. Meanwhile, the biggest issue was the navigation toolbar, which I didn’t even consider as a possible culprit when I first started doing this.
  • Sometimes the performance problem isn’t because your code is slow, but because your architecture is less than ideal in the first place (for example, the navigation toolbar repopulating with data that will never be seen).

Since I started this off with the qualitative comments of Castalia users, I’ll finish with a quote from an email I got just a few days ago, from a long-time Castalia user who said….

“The new version is even ‘Snappier’!!!! Great job!” – Raymond S., Castalia user since 2008

Come try the new, faster-than-ever-before Castalia at twodesk.com/castalia.

 

Read full story · Comments { 5 }

Castalia 2013.2.1 is now available

I’ve just released Castalia 2013.2.1, a bugfix update to my Smart Code Editor for Delphi Programmers.

This is a bugfix update that addresses two issues:

  • Fixed: Access violations when switching files under certain rare circumstances
  • Fixed: Navigation toolbar gets confused when switching between files with the same name at different paths.

Castalia users with current maintenance subscriptions can download Castalia 2013.2.1 right now at https://subscribe.twodesk.com.

Everyone else can try a free trial at http://www.twodesk.com/castalia/freetrial.html.

Read full story · Comments { 0 }

Castalia 2013.2 is available: Delphi XE4 Support, Huge performance improvements, and more

I’ve just released version 2013.2 of Castalia, my smart editor enhancement for Delphi programmers.

The biggest new item in Castalia 2013.2 is support for Delphi XE4. Castalia now supports all versions of Delphi from Delphi 5 to Delphi XE4 (Except Delphi 8).

Here’s a list of other major improvements:

  • New: Support for Delphi XE4
  • Improved: Navigation toolbar dropdowns have much improved interactive search
  • Improved: HUGE performance improvements in typing and scrolling the code editor
  • Improved: Memory usage footprint reduced
  • Fixed: “Invalid window handle” when closing Delphi in some circumstances
  • Fixed: “Invalid pointer operation” when closing Delphi in some circumstances
  • Fixed: “Invalid pointer operation” when switching files or projects in some circumstances
  • Fixed: Structural highlighting can show the wrong lines in C++
  • …and many other small tweaks and fixes

Castalia users with a current maintenance subscription can download Castalia 2013.2 at https://subscribe.twodesk.com

Castalia users with an expired maintenance subscription can renew at https://subscribe.twodesk.com

Everyone else can try a free 30-day trial of Castalia 2013.2 at http://www.twodesk.com/castalia/freetrial.html

Read full story · Comments { 1 }

Castalia 2013.1.4 is now Available

I’m excited to announce a new release of my Smart Code Editor for Delphi Programmers: Castalia 2013.1.4.

This release focuses on improving performance in a few crucial areas, including:

  • Performance improvements when typing in the code editor
  • Improved memory footprint
  • Fixed: “Invalid Window Handle” error when closing Delphi under some circumstances.

Current customers can download Castalia 2013.1.4 right now at https://subscribe.twodesk.com.

Everyone else can try a free trial of Castalia for Delphi at http://www.twodesk.com/castalia/freetrial.html.

Read full story · Comments { 1 }

For programmers, by a programmer

Hi. My name is Jacob, and I'm the creator of Castalia.

I starting programming in 1986, learning Lightspeed Pascal on a Mac Classic. Today, I'm a professional programmer, teacher, and entrepreneur.

I have a Master's Degree in Computer Science, and I still love Pascal and Delphi.

I believe that writing code is the heart and soul of software development, and I love helping programmers write code more effectively.