This post has been sitting in the backlog for a while, but with the release of version 0.8 of Very Sleepy, now seems like as good a time as any.
Introduction
I can’t remember where I first found Very Sleepy (originally known as Sleepy). I’m guessing that I needed a profiler and thought I’d get results quicker with Google rather than trying to scavenge a VTune license code. Another possibility could be that I had already read Kayamon’s blog, found a post about the profiler and wanted to give it a go.
Getting started
The first thing I wanted to do was run one of our apps through it to get a starting point for optimization. Although the typical model was to profile an already in-flight application, I wanted to launch the application directly, run it for a bit and then stop it and review the results – to be honest, I can’t remember if it allowed you to do this or not. However I do remember that at the time, Very Sleepy would crash when the profiled app closed which would have been a kicker … except for the fact that a source package is available so you can build your own copy and fix it. A little compile, link and run and I could reproduce and then fix the crash. This meant I could send a mail to the author with my fix which he rolled into the next release. To me, that was the best example I had ever seen of the GPL in action. I’ve always understood how the GPL was supposed to work while still understanding the problems of trying to integrate GPL code with commercial projects – it’s good to see it in action! Looking back at the email conversation from that time, that was Very Sleepy 0.5 and that was two and a half years ago.
Getting better and faster
With Very Sleepy 0.6 a couple of months later, there were a couple of minor tweaks I wanted to make. The first was the launch dialog – I wanted to specify a working directory but at that point I could update the code to derive the working directory from the application path as a short term workaround. The other thing I wanted was to profile the threads allocated after the application launched since we spawn worker threads once we’re running, something slightly more complex than I could achieve at the time, but there’s always the option to dig deeper into that if I wanted (or alternatively profile once running).
With Very Sleepy 0.7 came my first attempt to really use the profiler for a long run of one of our apps. I immediately found issues with the interactivity of Very Sleepy when inspecting profiles with a large number of samples. For me, the two main operations once a profile has been captured are:
- Randomly clicking on functions.
- Moving up and down callers and callees based on the percentage of calls between each.
My first step was to profile Very Sleepy using another instance of Very Sleepy – Inception style! Loading and inspecting my uber-profile allowed me to create another profile of the inner workings. The resulting profile was very interesting since a lot of the hits within Very Sleepy outside of wxWidgets were in STL, more specifically * and -> operators for iterators as well as iterator destructors.
Looking at the disassembly, it was obvious that redundant STL calls were not being optimized out by the compiler – it was somewhat scary to see what the compiler hadn’t decided to inline those methods that you’d assume it would. I’m assuming one of the reasons for this was not defining those newer magical _SECURE_SCL constants that disable things like Checked Iterators that are already known to affect performance, and then I even started to wonder if _HAS_ITERATOR_DEBUGGING was affecting anything. The first thing to do was dereference the iterator early and then access the object via a reference rather than an iterator – this solved the * and -> operator calls in the disassembly. The other thing I did was to change
for(iterator it = container.begin();
it != container.end();
++it)
loops into:
for(iterator it = container.begin(), end = container.end();
it != end;
++it)
which avoids the destruction of the temporary result of the end() method. Those two fixes pretty much cleared all of the performance lag when randomly picking functions making that a fully interactive process, however, double clicking functions in caller and callee lists still stalled and required further investigation.
Digging deeper, I found that the main function view was being fully rebuilt after each selection was made when all that was actually required was selecting the row for the required function and making sure that it’s visible. With that change in place, the caller and callee lists immediately updated the main view when double clicked – success!
And now 0.8
With the recent release of 0.8,I was ready to grab the code, try the new features and integrate my optimizations. The first thing that happened when I tried running my own application was having to hunt down the working directory so I thought I’d tweak to start at the directory containing the application when providing a directory selector dialog – +1 for having the source.
Once it was running and capturing a profile from one of our applications everything seemed good. However once the capture was complete and gathered, it crashed! Since I’m building my own and starting with F5 in the debugger, it broke into the debugger with a failed buffer security check. Walking up the stack, there was an overflow in SymbolInfo::getProcForAddr() when calling SymFromAddrW() overwriting the stack. Ironically I’d recently been reading Raymond Chen’s Old New Thing blog and had only just read What(‘s) a character! about the issues that arise when going from chars to WCHARS when moving to Unicode and documentation that intermixes chars and bytes which lead to the SYMBOL_INFO page which made me realise there was an issue with the MaxNameLen value being calculated in bytes and not WCHARS. For reference you want to change:
symbol_info->MaxNameLen = sizeof(buffer) – sizeof(SYMBOL_INFOW) + 1;
to:
symbol_info->MaxNameLen =
((sizeof(buffer) – sizeof(SYMBOL_INFOW)) / sizeof(WCHAR)) + 1;
Along with this fix and another few UI tweaks, I thought it was worth another try at contacting the author to see if the optimizations could be added for a future version – for me they make the difference between a laggy obstruction and an interactive application. Plan B is to fork off a build like some others have done but I think it’s such a useful app that it’s good to have it in one place for those who want it.