Featured Blog | This community-written post highlights the best of what the game industry has to offer. Read more like it on the Game Developer Blogs or learn how to Submit Your Own Blog Post
How to debug
In this article I share my thoughts on how to think about bugs, how to get rid of them, and how to get better at it.
A bug is when a program behaves in an unexpected way.
Bugs ruin games for our players, and a lot of our processes are designed to decrease the number of bugs that end up in the final product.
One part of that is fixing bugs – an important programming skill. In this article. I share my thoughts on how to think about bugs, how to get rid of them, and how to get better at it.
The main points I want to get across are:
Bugs do not only hurt players, they hurt the game development process itself.
Debugging is a scientific process.
Finding the cause of the bug requires us to move down one branch of a tree of all possible causes.
Considering only one cause at a time is ineffective.
Finding the cause of a bug and changing code to fix it are separate processes.
Some bugs have common patterns, and we can use generic strategies to deal with them.
We can improve over time by doing debugging post-mortems and keeping bug diaries.
What causes bugs
Causal chains
Consider this bug:
🐛When I hit play, I expect to see my avatar, but instead I don't.
We would be surprised if a game with this bug had code like the following:
//Hide the player when the level loads
OnLevelLoad += {() => {HidePlayer(); };
More likely, a less direct problem in the code leads to the behavior we see. We call the observed behavior the symptom.
To find the cause, we must find out why it happens. And then why that happens, and so on. This list is the causal chain that leads to the bug. The causal chain for the bug above may look like this:
(Symptom) 🐛 When the level loads, I expect to see my avatar, but instead I don't. Why?
(Direct cause) Because the code that moves the player calculates the new position as zero, and we cannot see the origin in the scene. Why?
Because the player update method exists early, because the inventory is null. Why?
(Root cause) Because it has not been initialized. Why?
(External cause) Because of sloppiness – I forgot.
The first cause outside the software that we can change, is the external cause, and is where we can stop our chain. The external cause of any bug is always one of the following:
Sloppiness (which is really a process bug).
Misconceptions (which is really a knowledge bug).
A software or hardware bug that we cannot fix.
Just above the external cause is the root cause. The root cause is the best place to implement an intervention.
All the causes above the root cause are surface causes, and the first surface cause in the chain is the direct cause of the bug. When we cannot fix a bug at the root cause, we can implement a workaround or partial fix at one of the surface causes.
The tree of possible causes
When we are confronted with a bug, we do not know the causal chain of the bug. Instead, we have a tree of possibilities. A partial tree for the bug above, may look like this:
To find the cause of a bug, we must move down this tree to the root cause. At each level, we do experiments to rule out all but one of the possibilities. We then expand this node by listing all its possible causes, and repeat the process. Eventually we arrive at possibilities that are all external. The cause just above this is the root cause.
The debugging process
Computers don't use magic.
Computers follow deterministic rules, and when computers fail, there is a reason, and by systematically working our way through the possibilities you can always find the reason.
Some steps need a dash of creativity, but effective debugging is a systematic process. You need to understand cause and effect, make analytical inferences from the knowledge at your disposal, organize tasks in an efficient order, track your progress, and record findings for future usage.
The separation of diagnosis and intervention
Removing bugs revolves around two principle processes:
Diagnosis: The process of finding the cause of a bug.
Intervention: The process of changing the code to remove the bug.
(The term "intervention" may be a bit odd; I use it because all words like "fixing bugs", "removing bugs", etc. refer to both aspects.)
There are two important reasons to keep these processes separate in your mind.
You can quickly do experiments to rule out bug causes without trying to fix the bug.
When deciding on the best order of performing experiments, you may make mistakes if you include estimates of interventions in time estimates.
Be precise when formulating a bug description
Standard bug report format. Vague reports of bugs introduce an unnecessary extra step into the debugging process: figuring out what the bug actually is. You and your team can avoid this by using precise language when talking about bugs.
A well-crafted bug report has three components:
The steps used to reproduce it (including relevant platform information).
What you expected to happen following the steps.
What happened instead.
In a sentence:
🐛When I did X, I expected Y, but instead got Z.
For example:
🐛When I collected a new sword, I expected to find it in my inventory, but instead my inventory was unchanged.
Use this format, even when informally reporting bugs, or when formulating the bug to yourself. This format
helps you remember all the information necessary to reproduce the bug;
allows you to spot bug patterns (explained below) and think about bugs on a "high level"; and
reminds you that you have an expectancy, and your expectancy may be faulty instead of the code.
Some types of behavior are harder to frame in this form, for example bugs that are sporadic, are seen at unpredictable times, or involve frequencies. Here are examples of dealing with this:
🐛When jumping, I expect to come down again, but sometimes (3/10) I keep rising and I must exit the level to continue.
🐛When I start the second level, I expect to continue playing, but instead the game exits at a random point before the end.
🐛When I run the game 10 times, I expect each of the two monsters to appear about 5 times, but instead the green monster appears 8 times.
Standard tests. You can also gain from using formal, named test cases, which allows efficient communication of what breaks where. For example, if you have a flow chart of the player login cases, with labeled nodes, you can say:
🐛When going through the login process through node A, I expected to reach node B but instead an error message displayed with the message "Already logged in".
Vague expressions. Vague words, like these below, confuse and waste time. Avoid them, and train testers to avoid them too.
"Crashes" (Premature normal exit? Or exit with error dialog?)
"Hangs" or "Freezes" (All animations stop? Or no further actions possible?)
"Is slow" (Framerate? Latency?)
"Does not work" (???)
The 8-step process
To fix a bug we can follow this process:
Reproduce the bug.
List possible direct causes.
Design experiments to rule out causes.
Prioritize experiments.
Do experiments until the cause is found.
If you are not at the root cause, drill down a level deeper. (Repeat from step 1, taking the last cause you found as the new bug).
Implement your intervention.
Clean up and verify.
At any point in this process, you may decide that continuing would be too expensive, in which case you "give up" on trying to fix the bug, and instead implement a workaround, or leave it as a "known issue".
Step 1: Reproduce the bug
The first step towards removing a bug is to follow the reproduction steps and verify that you see the bug. Reproducing a bug is important:
To make sure the bug is real, especially when dealing with subjective issues.
To make sure no other bugs lead up to the bug in question. (A common scenario is an error reported in the console that leads to a bug later in the flow.) Remember, each visible bug is part of a causal chain; if there are other bugs that may be lower on than chain you want to deal with them first.
To have a clear way to test whether the bug is fixed.
This step also allows you to gather more information, and to see if you can simplify the reproduction steps.
Step 2: List possible direct causes
List multiple possibilities. Once you managed to produce a bug, you may already have a suspicion of what causes it and may be tempted to try to fix it at once.
Don't give in to this urge. Instead come up with at least a few possible causes first:
It lets you overcome certain biases that can make you overlook the most probable causes. Interesting causes and causes you recently met, for example, are often easier to think of.
It allows you to optimize the order you do experiments in.
It prevents you from becoming invested in proving your suspicion, which could set you up for subconsciously ignoring information or spending too much time on uninformative tests.
When you cannot list causes. At times, you may not be able to think of any possible causes, perhaps after you ruled out all causes on the original list.
This happens when you lack information, or knowledge, or creativity. You can deal with this by:
Searching for symptoms online.
Questioning other developers.
Coming up with extreme and absurd causes to spark your creativity.
Researching the system that you are dealing with.
Doing experiments that reveal how the system works.
The last two strategies take time, and occasionally you may consider the bug "too expensive to diagnose".
Direct versus deeper causes. When listing possible causes for a bug, you need to list direct causes. Although we want to expose the root cause, if you start guessing at root causes you will have too many possibilities to test.
For example, not initializing a variable is a common root cause, but it could be the cause of any bug.
Start with direct causes – in Step 6 you go to the next level.
(Shortcuts are possible, but unnecessary.)
Step 3: Design experiments to rule out causes
An experiment is some action that gives you information about the source of the bug. Examples of typical experiments are:
Inspecting code or configuration.
Logging or inspecting a variable.
Seeing if a breakpoint is reached when running the game.
Running the game with code commented out, altered, or added.
Running the game with a different configuration.
Good experiments are tests that rule out possible causes.
How do we rule out division by zero? We clamp the divisor away from zero and see if the bug is still there.
How do we rule out bad data? We replace it with known good data.
How do we rule out a piece of code is not being reached? We put a breakpoint and see if it is hit.
How do we check whether a calculation is wrong? We replace it with a hand-calculated result.
How do we rule out that we forgot to hook up a handler? We check.
Step 4: Order experiments to reduce the expected time to diagnose a bug
Don't do experiments in the order that they occur to you. Instead, assign rough time estimates to experiments, and likelihoods to the causes they rule out. Then put experiments first that are quicker to do, and rule out likelier causes.
Don't take the time of interventions into account when you decide on the best order. In the end, you will implement only one intervention, and no matter what order you follow, the time it will take will be the same. Therefore, it should also not affect the order.
Step 5: Do the Experiments
This step is straightforward: do the experiments in the order you decided in the previous step, and stop once you have the culprit cause.
If you make changes to the source code, use source control to help keep track of your experiments. Document what you are doing, make careful comments, and note the results of your experiments.
Your progress documentation:
prevents you from getting confused about where you are in the process,
can be used to ask questions to other developers on your team and on answer sites,
will form the basis of a bug diary if you keep one (described below), and
will be useful if you do a debugging-postmortem.
Step 6: Drill down to the root cause
If we are at the root cause, we can continue to the next step.
Otherwise, the cause we uncovered in the previous step is treated as the new bug. We reformulate the bug in the When X I expect Y but Z-form, and continue to Step 1. Reproducing the new bug is a sanity check to make sure you are on track.
How do we know we are at the root cause? If we ask "why", and the answer is:
Because of sloppiness.
Because of a misconception.
Because of a bug in external hardware or software.
If we do not know, we are not at the root cause yet.
Step 7: Implement an intervention
Once you are sure what causes the bug, you can fix it. Usually, you should implement the intervention to take care of the root cause.
Consider the example bug we examined before, with this causal chain:
🐛When the level loads, I expect to see my avatar, but instead I don't.
Because the code that moves the player calculates the new position as zero and we cannot see the origin.
Because the inventory is null and the player update method prematurely exists.
Because it has not been initialized.
Because I forgot.
This bug can be "fixed" in any position in the chain.
We can recalculate the position if it is zero.
We can change the update method to only skip over the relevant part where we work with the inventory.
We can initialize the inventory when the player is initialized.
The fixes will all "work", but they are not equal.
Fixing bugs at the root cause is usually the safest thing to do; if we don’t we invite other bugs into the system. If you do not initialize the inventory, you will need to skip over operations with it all over the program.
At times, fixing the root cause may be too expensive (for example, if you have to redesign the architecture of your game to ensure a predictable initialization order). When this is the case, consider implementing a workaround at a surface.
Step 8: Clean up and verify
The last step is to undo any changes you made during the diagnosis proses, and to verify that the bug stays fixed afterwards.
Some of your diagnostics may be useful to keep for the future:
Leave assertions (if they hold in general circumstances and not just those the bug was diagnosed in).
Move temporary code that visualize data to a library where you can re-use it if the occasion arises.
Logging statements are only useful when they are consistently applied to a part of your program, and generally should be removed. Occasionally, they are generic enough to be useful (for example, logging all server calls and responses).
Good debugging practices
Don't tolerate bugs during development
Building zero-defect software is very expensive, and a counter-productive strategy for games.
But bugs are not only bad for the players if they land in the end-product; bugs – even small bugs – that live in the code-base during development are problematic.
Keep the following in mind in deciding where to strike the balance between "too expensive to fix" and "too expensive to keep around".
Bugs annoy and slow down developers and testers. A system that does not work as intended (up to that point) prevents everyone on the team to do their work effectively.
Bug diagnosis cannot reliably be scheduled. The process of finding a bug is a search through possibilities that stops once the cause is found; the time this takes is unknown. The more bugs in your game, the less sure you can be of the shipment date.
Bugs can hide the existence of other bugs. For example, if a bug prevents a block of code from executing, you don't know if another bug hides in that block.
A bug signals a problem that can cause other bugs in the future, or that causes other bugs not yet seen. A problem that appears small now can grow as the root cause manifests itself in other ways. It can also manifest itself through code that other developers write, and will waste their time as they try to track down the problem.
The older a bug is, the harder it is to diagnose and remove. A bug introduced today is easy to fix – you still remember how the code works, the amount of code where it can hide is small, and fresher bugs don't obscure the results of experiments.
Bugs have a negative psychological impact on the team. Programmers will be less careful to introduce new bugs in an already buggy system, or will dread having to make changes in problematic areas. A buggy system is not good for moral.
Recognize bug patterns and use generic strategies to deal with them
A few types of bugs crop up regularly, and these bug patterns have similar debugging strategies. Here are some examples:
🐛When X, I expect Y but instead got nothing.
🐛When X, I expect Y, but instead Z happens that prevents Y.
🐛When X, I expect an object to still be in state A, but instead it changes to B.
🐛When X, I expect an object to change from state A to B, but instead it changes to state C instead.
🐛When X, I expect to see A, but I don't.
Here is an example of an instance of the first of these:
🐛When I press the "Play" button, I expect the game level to load but instead nothing happens.
We can use generic strategies to diagnose these bugs:
Find the place in the code where Y is implemented. Between X and Y, you will find code that is (abstractly) of the form if (X && C1 && C2 && ...) Y. Do experiments to find out which of the conditions (X, C1, C2) are failing.
Between X and Y, you will have a sequence of statements (abstractly) S1, S2, ... The strategy is to find the first S where Z happens.
Find instances where the state is changed. If there are only a few instances of this, do experiments to find out which one is the culprit, and drill down from there. If there are many instances, wrap the state in a variable that reports when it changes to the unwanted state. Use this to find the faulty caller, and drill down from there.
There are two possibilities: the state is miscalculated, or it is calculated correctly but the change is applied the wrong number of times. Do an experiment to distinguish between these cases.
When you cannot see something, there are only a few possible causes:
It is not there
It is there but in the wrong place
It is in the right place but it is
inactive
transparent, clipped, or culled
too big (you are inside the object) or too small
(in the case of planes), facing the wrong way
Run the game and look for the object using your editor tools. If you don't find it, and it was there from the beginning, you have a case 3 pattern; if it was not there from the beginning, you have a case 1 pattern. If you do find it, a few quick experiments through inspection will tell you which of the possibilities apply. Again, if the state was like that from the beginning you have a case 1 pattern, otherwise a case 3 pattern.
I listed 5 common patterns above, but you will spot other patterns in your code base. When you do, develop and document a strategy to debug these very quickly.
When tracking a pattern bug, and your strategy fails, remember to update the strategy. Patterns with strategies that become too complicated as they are updated after several failures seize to be useful patterns, and need to be pruned from your collection.
How to become a better debugger
Practice
There is no shortage of bugs to fix, and over the span of a career you will get plenty of opportunities to practice.
If you are a new programmer with some time to spend on technical self-improvement, it may be worth doing a few weeks of intense debugging to experiment with processes and techniques. Delve into a code base, or answer questions on GameDev StackExchange.
Another way to practice is through an exercise we have been using in our studio.
Ask another programmer for a juicy bug he or she has recently fixed.
List possible causes, and arrange them in the optimal order.
Now ask the other programmer to tell you what the result of each experiment would be, until you isolated a cause.
Drill down to the root. Can you diagnose the bug? How long does it take you?
Do bug-finding post mortems
The purpose of a bug-finding postmortem is to find ways to improve bug diagnosis, and it is done for bugs that took an unexpected long time to diagnose.
During a debugging post-mortem, ask questions like these:
Did communication about the bug cause delays?
Was the list of causes comprehensive at each level of the cause tree?
What misconceptions caused unnecessary delays in tracking down the cause?
Where the time estimates for experiments correct?
What experiments could be improved or replaced with better ones?
(Photo: Barta IV)
Keep a bug diary
A bug diary is a list of bugs, how they were diagnosed, how long it took, and what intervention was implemented.
This is occasionally useful as a reference when dealing with a bug you have dealt with before, making it much faster to diagnose and fix.
A history of bugs allows you to spot recurring bugs. Recurring bugs in principle can always be prevented (through coding practices, processes, tools, libraries). For example, if you constantly swop geocoordinates, you can use one of these ways to prevent it:
Coding practice: Better naming conventions that makes these mistakes more obvious.
Process: Tests that are designed to flush out these types of mistakes, for example, using recognizable locations.
Tools: Tools for visualizing locations on a map that makes it easier to spot discrepancies.
Library: A data structure that handles all common operations so that we don't need to use components individually, perhaps supported by UI components. The idea is to solve the problem once, and stash it in a library for repeated use.
Over time, your diary will allow you to improve estimates for the probabilities of certain bug-causes, and the times to perform more complex experiments.
Finally, your diary can also help you spot deeper weaknesses. Looking at the root causes and human causes of bugs point to where we can improve architecture, our programming practices, and our understanding of the system we are working with or trying to model.
For example, at Gamelogic we had a swarm of bugs caused by programmers not understanding how other programmers hooked up and implemented their complex GUI components. Eventually, we adopted a canonical design which all GUI components must follow. Since then, bugs caused by misunderstanding the setup or implementation of a GUI component has been rare.
Learn how things work
Misconceptions is one of the three external causes of bugs, so knowledge that remove misconceptions reduces the number of bugs you will create in the future.
Learning the math of physics, or the details of the Facebook API, or the darker corners of C#, will not only decrease the number of bugs you introduce in the future, it will also lead to more innovative games, faster development, better testing procedures, and better debugging experiments.
Use your bug diary to find the weak spots in your knowledge.
Learn specialized experimentation techniques and tools
Many areas in game development have specialized tools and techniques for debugging, and knowing these can save you a ton of time. Investigating whether an area you deal with frequently has specialized tools.
Here are some examples of what I mean:
Conclusion
Bugs disrupt the development process, and therefore dealing with bugs effectively is important to streamline development. And because computers are deterministic devices, we can use a scientific process to diagnose bugs in the most efficient way possible. By looking at our process over time, we can improve our process to create fewer bugs and find them faster. We do not have to be slave to our bugs.
Acknowledgements
The bug report format is from Joel Spolsky.
Some of the ideas in this article is my take on those in Effective Debugging: 66 Specific Ways to Debug Software and Systems by Diomidis Spinellis.
This article was inspired by a bug that plagued our small studio for a month, and many ideas I present here was sharpened by the many discussions I had with our programmers: Omar Rojo (the phrase "Computers don't use magic" is his) and Esteban Gaetes.
Read more about:
Featured BlogsAbout the Author
You May Also Like