I’ve been following lots of Bruce Dawson’s posts about xperf recently and have finally taken the time to get my first stats out of an application with xperf and look at the results with Windows Performance Analyzer (WPA). It’s not as hard as it appears and I thought I’d try to write something up so that others can take advantage.
Y u no Very Sleepy?
You might be wondering why I’ve gone down this route after previously discussing my love of Very Sleepy? Recently I’ve been trying to investigate performance in a system with several worker threads, and in Very Sleepy, I’m seeing most of the CPU samples taken while the workers are waiting for work. Since the CPU samples are aggregated for the application, this heavily reduces the relative hit count for every other sample, and it can look like the app is spending most of its time asleep rather than reflecting the actual work that the app is doing. Thanks to the GPL nature of Very Sleepy, I’m able to go in and rework the internals to track and use the thread ID per sample and then update the interface to expose this. However thinking of the work involved to extend Very Sleepy in this way, and based on all of the information passing me by about xperf, I decided to go down the xperf route instead.
The best place to start is Bruce’s basics post. This tells you where to get the Windows Performance Toolkit (WPT) via the Platform SDK Installer. One thing to know is that you can use the Platform SDK Installer to install only the Windows Performance Toolkit and you don’t need to install the entire Platform SDK – take a look at the useful images here to see roughly what the installer looks like.
Once installed, there’s one step from Bruce’s basics you need to take care of: updating a registry key to allow you to capture stacks on 64 bit Windows (and the standard reboot). The command like for the registry update is:
REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f
You can skip this step, but it pretty much leaves you with minimal useful stack information when looking at your profile information.
Capture the trace
Capturing your trace is simple. Start the xperf tool, do whatever you want to capture, stop the xperf tool.
The first problem you’ll most likely have is that starting and stopping xperf requires Administrator privileges. Not having Administrator privileges means that you might get something like:
xperf: error: NT Kernel Logger: Invalid flags. (0x3ec).
Once you’re at a command prompt with Administrator privileges, I’ve found the following as a good command line for starting a trace:
xperf -on PROC_THREAD+LOADER+INTERRUPT+DPC+PROFILE -stackwalk profile -SetProfInt 1221
Breaking this down:
- -on= Start profiling the following providers:
- PROC_THREAD = Detect the starting and stopping of processes and threads. Without this, you don’t know when your app starts and finishes!
- LOADER= Load and unload events – without this you don’t appear to get module information.
- INTERRUPT = Detects interrupt events and appears to avoid spurious entries in the performance results.
- DPC = Deferred procedure call events – appears recommended, but I didn’t see any major difference.
- PROFILE = enable profiling (d’oh!)
- -SetProfInt 1221 = Maximise the sample rate – this results in sampling rate of approximate 8KHz (1221 * 100ns = 0.1221ms) which is the limit of xperf.
Whatever application you run, it’s worth mentioning that you want to run it for quite a while since that 8KHz sampling rate is at the core level across all processes. You might also want to consider changing your application to run in a micro-benchmark mode if there’s an area you want to focus on.
Stopping the trace is as simple as:
xperf -d NameOfTrace.etl
- -d = Stops and packages up the state of the trace. This means that the trace contains information specific to your computer and a large block of text appears to warn you so. There are less intrusive methods for stopping and writing the trace, but if profiling locally, it’s not a problem.
- NameOfTrace.etl = The name of the output file with the etl suffix already registered to be opened with WPA.
Suggested further reading:
- Bruce’s Xperf Analysis Basics
- Bruce’s Xperf Basics: Recording a Trace
- MSDN Detailed Walkthrough (not totally detailed)
- MSDN Using the Windows Sample Profiler with Xperf
A small interlude – Symbols
WPA will automatically pick up your application’s symbols from the pdb referenced from the exe. However, for Windows symbols, you can tell WPA to pick up the symbols via the _NT_SYMBOL_PATH environment variable either outside (before starting) or inside (via a dialog) – personally I thought the environment variable would be easier in case anything else wants access (or a third alternative could be no symbols, but that complicates navigation and interpretation).
If you’ve previously set up symbols for Visual Studio 2010, then you probably already have a local symbols cache. In that case you could set your _NT_SYMBOL_PATH environment variable as something like:
where C:\Temp\Symbols is the path I put in “Cache symbols in this directory” file finder under Tools->Options->Debugging->Symbols in Visual Studio 2010.
Bruce’s basics has more coverage of symbols and also tells you what to do if you get shouted at by a Symbol Store licensing dialog with something like:
MICROSOFT SOFTWARE LICENSE TERMS
MICROSOFT DEBUGGING SYMBOLS AND EXECUTABLES
Bruce has also recently posted about Xperf Symbol Loading Pitfalls which would be good to read if you have any problems.
Extract the results
Since you’ve been starting xperf from an Admin dialog, it’s very funny that WPA doesn’t like being run with Administrator privileges and will ask you if you’re sure, so starting it there isn’t the greatest idea. Instead, I double-click the etl file to open it. Welcome to WPA!
The first thing you see is a set of graphs. The Graphs menu lets you choose which graphs to show. You can investigate the whole capture without selecting part of the graph but their major use (for me) is to view where the process to analyze spiked in CPU use. To focus on a single process this, you need to:
- Focus on the CPU Sampling by Process graph.
- Click on the Processes drop down then:
- Click the All Processes tick box to deselect all processes
- Scroll down and tick the box for the required process
This should allow you focus on how much CPU your process is using so that you can select areas containing spikes, or leave the graph unselected to see the samples from the whole timeline.
Before jumping straight to the Summary table, right-click and select Load Symbols (… wait a short while …), then right-click again and select Summary Table. Now go make some tea because the symbol loading takes quite a while, even if you think you’ve got an up-to-date cache, it will still need some updating. Once it’s done, you’ll have a fully populated summary table which we’ll need to selectively reduce.
Start by ensuring that the leftmost column is the Process column, and drag it there if not. If the Process column is missing, go to the Columns menu and enable it. Next, scan down and click on the row representing your process to select it. Then right-click and select Filter To -> Selection and this will remove all of the processes you don’t care about. Note that this creates a new window and pushes the old one to the back in case you still need it.
Now the analysis can begin for real!
Review the results
There’s two main cases I want to cover: Bottom-Up and Top-Down. Both are easily investigated based on the ordering of the columns.
To analyze your profile from the bottom up:
- Function – Tells you the most hit function.
- Stack – Gives you the stacks to those functions. Start without this to get the highest level hot spot overview.
- ORANGE BAR – Splitter used as the pivot for the results.
- Weight – The weight of the hit count. This is approximately equal to the number of hits. (Note the decimal point otherwise you’re going to think the sampling rate is much better than it actually is).
For extra flavor add one of the following as the first column:
- Thread – If you want to break the results down by thread before any other factors. Note that your threads will be ordered in order of weight (i.e. most CPU intensive).
- Module – If you want to look at the functions per module – a good way to only focus on the code you can change by looking into your module.
Add the Module and Thread columns as the leftmost two to get a breakdown of the common hits per thread in your module.
To analyze your profile from the top down:
- Stack – Gives you the roots of the stacks. You can drill into these
- ORANGE BAR
Add the Thread and Module columns again if necessary.
For further reading, check out Bruce’s Lost Xperf Documentation–CPU sampling post which contains lots more background information and tells you what all of the columns mean.
One thing I want to check out is the new Windows Performance Analyzer as mentioned by Bruce here. It’s more deeply embedded in the Windows 8 SDK or Assessment and Deployment Kit than the xperf components.
Part of me still wants to go back to Very Sleepy. It still has some advantages over xperf such as:
- The ability to bind samples back to code to localize hotspots beyond functions
- Focusing the sampling on a specific application rather than all applications.
- The callee/caller interface is easier to go both up and down (but conversely harder to track than a visual representation of a stack).
- Access to the source is a great thing to have if you do need to fix anything.
There’s also a lot of emerging anecdotal posts about diagnosing performance issues with xperf using other profiling features:
- Blog about using xperf to diagnose startup performance issues:
- Mozilla Profiling with Xperf
- The Case of the Unexplained High CPU Condition