In-depth: More adventures in failing to crash properly

In this reprinted <a href="http://altdevblogaday.com/">#altdevblogaday</a> in-depth piece, Valve's Bruce Dawson continues to look at how to "create more stable software through crashing vigorously," examining failures to terminate and recording a crash dump.

Bruce Dawson, Blogger

July 25, 2012

6 Min Read

[In this reprinted #altdevblogaday in-depth piece, Valve's Bruce Dawson continues to look at how to "create more stable software through crashing vigorously," this time looking at failures to terminate and record a crash dump.] In last week's episode, we discussed how 32-bit processes on 64-bit Windows might corrupt the exception state after a crash, and how any processes on 64-bit Windows might actually continue running after a crash. Serious stuff. This week's installment of "Failing to Fail" is less dramatic, but still important for developers who want robust software, as we cover failure to terminate and failures to record a crash dump. As a special bonus, I also mention how to record crash dumps from all crashing processes on your machine, to make debugging easier than ever before.

What we have here is a failure to terminate

Crashes happen. Any program more complicated than "Hello world" probably has some bugs. One measure of professional software development is how you deal with these crashes. What should happen is that the program should save a crash dump and then commit suicide (TerminateProcess() or _exit(), not ExitProcess() or exit()). What you don't want is for the doomed process to put up a dialog saying "Hey, I'm a doomed process". But unfortunately that is what the Visual C++ C Run Time (VC++ CRT) does in some cases, as we see to the right. If you accidentally call a pure virtual function (see the sample code for one possible way this can happen) then the handler for this brings up a dialog. If you're a developer, then you can attach a debugger and get a call stack, but most of the world is not developers. They don't know what a pure virtual function call is, and they don't care. Displaying this dialog just slows down the crash recovery process, while confusing your users. But it's worse than that. If you have a bevy of exception handlers ready to catch Win32 exceptions (access violations, etc.) then you will be disappointed because they won't catch pure-call errors, even after someone presses OK. So, your in-house crash-dump recording system is helpless against this bug, which means it takes longer to get it fixed. Worse yet, if this error happens on a server (I've seen it happen) then your headless server now has a hung process that is waiting for someone to click OK. Unit tests will timeout eventually, and servers may timeout if you have a watchdog, but the whole process is delayed by this dialog. I wouldn't be writing about this unless I had a solution to offer. The dialog above is the default behavior, but changing the default is simple enough once you know that you should. All you have to do is call _set_purecall_handler() with a function that intentionally crashes. My preferred implementation does a __debugbreak() followed by TerminateProcess(). If I'm running under the debugger this drops me into it quite neatly, and if I'm not then my unhandled exception filter will catch the exception and write out a minidump. The TerminateProcess() is there to discourage people who catch the exception in the debugger from trying to continue. See the sample code for a concrete example of setting this up. You can use the menu options to try triggering pure-call errors with and without installing the error handler.

Invalid parameters aren't technically crashes

The VC++ CRT detects a few types of invalid parameters to CRT functions and it treats them as fatal errors. This includes buffer overflow detection if you use the safer CRT functions (and you haven't requested truncation), but the simplest way to trigger these checks is with "printf(NULL);". No dialog pops up – at least not in release builds – and the process is terminated, but it isn't terminated through calling your carefully crafted exception handlers. Windows Error Reporting (WER) will be notified of the problem, which is good, but I want these invalid parameters treated like a crash so that my exception handlers get invoked. Luckily there is an easy solution for this problem as well. If you call _set_invalid_parameter_handler() then you can give it the same code (just with a different signature) as for your pure-call handler so that your exception handlers will notice something has gone wrong. And now your programs will be crashier than ever before. Which is a good thing. This technique is also demonstrated in the sample code.

WER is your friend

Windows Error Reporting (WER) is a handy feature built into Windows. Most developers know that WER records crash dumps on millions of users' machines and stores them, and most developers know that it is possible to get access to the crash dumps for your software. This is a fabulous way of finding out where your software is actually crashing on actual customers' actual machines. There are a few hoops to jump through, but it's worth getting it set up. However I have no special knowledge of how to arrange such access so I will say no more. A lesser known feature of WER is that you can get it to record crashes on your own machines. All you have to do is set a few registry keys. I'm gonna go out on a limb here and say that every C++ developer on Windows should configure this. It's trivially simple and WER will sometimes catch crashes that your other systems do not. WER is great at catching process startup and shutdown crashes, crashes in processes you forgot to add minidump handling to, and it even records minidumps for pure-virtual function calls and invalid CRT parameters. The full documentation is available here. If you spend two minutes configuring this (I have the last 30 crashes saved as full dumps in c:tempcrashdumps) then you will be better able to investigate crashes on your machine, regardless of what process is crashing.

Update – one more missed failure type

Stefan Reinalter pointed out that some libraries will handle errors by calling abort(), and this can be another way for a process to fail without your crash handler being called. He also supplied the fix, which is to call signal(SIGABRT, &AbortHandler); to install a handler which will be called if abort() is called. Signal can also be used to install handlers for other types of failures.

Homework

It's not enough to read about this, you have to actually do a tiny bit of coding and registry work to get things crashing smoothly. Here are your tasks.

Be sure to call _set_purecall_handler, _set_invalid_parameter_handler, and signal. If you use the DLL version of the CRT then once per process is fine. If you use the static-link version of the CRT then you need to call it once for each copy of the CRT – once for each DLL that statically links the CRT. The sample code available here should help.
Configure the registry to save crash dumps on all of your machines, by following the simple directions here.
If you haven't already then be sure to follow the instructions in last week's post, including configuring VS to halt on first-chance exceptions, calling EnableCrashingOnCrashes(), and using SetUnhandledExceptionFilter() to catch crashes.
Set up a system for recording and uploading minidumps, using MiniDumpWriteDump or breakpad or the Steamworks APIs.

That's it. Good luck with the goal of more stable software through crashing vigorously. [This piece was reprinted from #AltDevBlogADay, a shared blog initiative started by @mike_acton devoted to giving game developers of all disciplines a place to motivate each other to write regularly about their personal game development passions.]

About the Author(s)

Bruce Dawson

Blogger

Bruce is the director of technology at Humongous Entertainment, which means he gets to work on all the fun and challenging tasks that nobody else has time to do. He also teaches part-time at DigiPen. Prior to Humongous Entertainment he worked at Cavedog Entertainment, assisting various product teams. Bruce worked for several years at the Elastic Reality branch of Avid Technology, writing special effects software and video editing plug-ins, and before that worked at Electronic Arts Canada, back when it was called Distinctive Software. There he wrote his first computer games, for the Commodore Amiga, back when thirty-two colours was state of the art. Bruce presented a paper at GDC 2001 called "What Happened to my Colours?!?" about the quirks of NTSC. He is currently trying to perfect the ultimate Python script that will automate all of his job duties - so that he can spend more time playing with the console machine he won at a poker game. Bruce lives with his wonderful wife and two exceptional children near Seattle, Washington where he tries to convince his coworkers that all computer programmers should juggle, unicycle and balance on a tight wire. You can contact Bruce Dawson at: [email protected]

See more from Bruce Dawson

Related Topics

Related Topics

Recent in More

Related Topics

Related Topics

In-depth: More adventures in failing to crash properly

What we have here is a failure to terminate

Invalid parameters aren't technically crashes

WER is your friend

Update – one more missed failure type

Homework

About the Author(s)

Latest News

Trending

Featured Blogs

Game Developer Essentials