Intel’s VTune has a long-standing reputation as one of the better tools for application analysis — at least for applications headed for Intel-based systems. I hadn’t touched it since version 3.5, and I was more than curious to see what new improvements were implemented in the new version 6. I wasn’t disappointed either.
I had actually used the Xbox version of VTune only a few months earlier, and the install process had been more than a little painful, so I was a bit nervous. However, apart from a few redundant reboots and a small problem recognizing the installed version of Flash, everything went smoothly.
I created a quick wizard project, turned on call graphing, and off it ran. After 20 seconds, the project stopped, reran, and then did it again. Was this a bug? On reading the tutorial further, I discovered that the first pass was a calibration pass, the second pass was the actual samples, and the final pass was call graphing. Simple enough, but why did it all only take 20 seconds? It seems that is the default execution time; this might not work well if that time is spent loading, though fortunately you can modify activities and increase this time. Even without call graphing you’ll need the calibration pass, so I manually aborted that after a minute or two, and then left the main sampling pass for a few minutes. There is also a simple option for the project to run an application with sampling paused and then have the user manually resume, so you can add your own hooks in your code to generate resume/pause messages using a vtuneperformance.dll. This came in very handy for isolating samples to specific areas (for example, if the frame rate drops, call a resume function to start logging samples).
VTune 6 Call Graph.
I made numerous attempts with sampling, trying to get a good representation, and it didn’t take long for me to remove the call graphing. I think this is a feature to turn on only in trouble spots, as it’s very intrusive in execution timing and slows down a game to the point where it’s not really useful unless you know what you’re looking for — or if you’re smart and have set up some prerecorded joystick presses that can walk through a game perfectly every time (I’m not that smart).
The results I got without the call graph were great starting points. I recorded about five minutes of samples and counter events. After sampling, the data is displayed in graph form representing everything from CPU percent time, privileged CPU percent, page misses, thread queuing, and more. Intel has supplied a lot of mechanisms to view this data, including graphing as splines, blocks, solid, or wire form, and it’s all very customizable. I chose a spline form, though I recommend playing with the display a bit as it does impact how you perceive the data execution. Using another icon, I selected a time range in order to investigate a peculiar spike in processor-privileged time that looked odd. I highlighted a small, one-second range, hit the drill-down icon, and was rewarded with a much more detailed breakdown. At this point I had numerous modules (DLLs and also the exe) I wished to look at, but I couldn’t merge all the modules together. It’s a minor annoyance, but one I can live with until version 7.
Now, it was a simple case of double clicking on the desired module to bring up a detailed source breakdown, though it did ask me to specify the dsp (project or makefile). This led to my second problem: the project I was debugging has numerous dlls as well as the main exe, but the system only wanted to accept a single project file. I wanted it to ask for my workspace from Visual C++ (the .dsw), as it is an extremely complex source base. The good news for Java and other non-C++ people is that VTune does allow you to specify a multitude of project types. I was very happy to see Java, .net, and even FORTRAN supported (does anyone really still use that?).
By now I’ve screamed a few times in my head, and once out loud, when my eyes gravitate to the top 10 functions that stall out the execution. Intel’s terms are CPU Clockticks (non-sleep) and Instructions Retired (sounds like a CIA euphemism for a shagged compiler). One particular surprise VTune found was a bit of code that looked harmless enough, but when I viewed the source with disassembly it showed the NEG assembler instruction was kicking the function’s teeth in, taking a bite out of its performance, and doing this twice. After a quick fix, it dropped off the top-10 list completely.
I tried another of my top-10 items, and rather than trying to find a solution myself, I just right-clicked on the source and selected “VTune Assistant, This Function” (you can also do this to a selection or an entire file). Now this is where I started to get impressed. The assistant returned about 20 occurrences of problems with nice little light bulbs at the lines concerned. It also offered a light bulb at the end of the function, with suggestions on a general problem and solution. What I was very pleased to see was the comment “Logical AND/OR statement conditional,” which offered me a very informative description of what it believed could be done.
VTune’s assistant is one I would gladly hire, or at least get writing PS2 code. It also caught the loop invariant catch, where you resolve a pointer to a pointer within a loop or a for statement that has the count of a pointer to a class — nasty stuff. But my favorite feature and the one I am very interested to do more with is vectorization, which crops up a lot with virtuals and templates. If you’re a progressive C++ engineer who likes to template array handling, then VTune’s going to ring your bell, because one of its recommended optimizations is to recommend the new Intel C/C++ compiler version with the new supported vectorization pass, meaning that the compiler will use SSE instructions for some loop operations. I saw this compiler at GDC this year, and it was very impressive (at least by this feature). Beyond that, it recommends restructuring the code to allow for better vectorizing.
I’m generally very happy with the assistant, though I thought it lacked a single critical feature, which really got to be frustrating. The assistant is in its own window, and it displays the line number, but I can’t click on the line number and have the window scroll to the correct line of source, so I had to scroll down manually to the line number. This became annoying, especially when I started to do class analysis over function analysis and wanted large blocks of code to be analyzed by the assistant so that I could just scroll through the trivial fixes. Another problem was I could not jump to code in the editor. For those who have used SN Systems’ debugger, it has a hot key, Ctrl+E, that jumps to code in Visual C++. I sorely missed this when VTuning my data.
All in all, I thought VTune 6 was extremely good, very stable, and the tutorial was insightful and valuable, making my life significantly easier. The in-depth help, and the fact that hitting F1 on any item brought up the correct context help menu, was invaluable. I had only one crash during a marathon six-hour session. Even when I did crash, when I got back up and running, my project was intact, and it was just a minor inconvenience. VTune 6 is a must for developers, and you don’t need to be an assembler wiz to use it (though it does help).
Software: Microsoft Windows 98 (SE), Windows ME, Windows NT 4.0 with Service Pack 4 or later, Windows 2000 Build 2195 or later, or Windows XP Build 2475 or later. Microsoft Internet Explorer 5.0 or later (5.5 or newer recommended).