[In this reprinted #altdevblogaday-opinion piece, Yager Development's lead programmer Andre Dittrich shares the practices he's picked over the years in making his studio's builds as stable as possible.]
Build stability is always an important topic for us, but once a game production has entered the production phase in earnest, the stability of the game and the tools becomes one of the more important aspects for the tech team. The simple reason for this is that the number of people relying on this is highest at that point and any time these people have to wait for a bugfix or missing tools potentially means a lot of money wasted. So, keeping your build as stable as possible is important.
And now for the bad news: I do not have the "This Solves All Our Problems" recipe. I want to share some of the measures we have applied in our projects. If you have other measures you have taken to ensure build stability, please tell me. I am always interested in doing more.
Iteration Time Rules
Having a stable build is very important – yes, but you cannot ruin the iteration time for your team. There will always be that level designer that requests a small feature, a small change or simply needs a critical bug fix really fast (usually yesterday) to finish the mission for the next milestone. You do not want him to wait for a week for that change.
With 10, 20 or even more engineers working on your code base at the same time, the chance is high that there is always at least one that has added a bug that makes it impossible to release the next engine version to the team at least if you do not take some measures that help to keep the build stable.
The problem of course is that the measures you take cannot add so much overhead that they become a reason for slow iteration times. So, everything you do needs to strike a balance between overhead and improved build stability.
Automated Build Systems
CIS – continuous integration server: you need this! It is bad enough if "real" bugs trouble your build – it is far worse if simple bugs destroy it. Ever come into the office in the morning to find out that you cannot compile the game? A typo, a file that had not been checked in, a bad merge? How many people lost how much time during this one morning? This is totally avoidable.
The main function of our CIS is to continuously build the engine whenever somebody checks in a change. This makes sure that the engine and tools at least compile. Of course, we also run a few easy and fast smoke tests that also make sure that you can at least start the engine.
But you can do even more. During the day, the focus is on getting the engine build as fast as possible and running smoke tests. During the night, we can do a lot more. We run automated tests to get statistics for memory usage and performance in test levels and game levels. These statistics are made available as graphs on an internal website. These graphs are an enormous help to recognize and track down sudden jumps in either performance or memory as well as gradual development.
Together with good check in comments (see below), you can prevent this from breaking the game before it actually becomes a problem, or you at least recognize the problem very fast and efficient (without TAs or programmers spending time to find out why MissionXY is not running any more).
When I talk about automated tests, I guess I have to talk about unit tests as well. I have some experience with it, though I have to admit that most of it is about how not to do it. We integrated a unit testing framework into the Unreal Engine on the Unreal Script and Kismet level pretty early in the production process. We started to use it for the AI code mostly as this was mainly written by us and not relying too much on middleware code (except pathfinding).
The main mistake we made was that we ended up with actually doing integration tests and maintaining those takes a lot of time. For some time, we even made it part of the process to have "unit tests" for every feature we did. At some point we started spending more and more time on fixing the tests which were failing because of changes in other systems and not because of bugs in the tested code – we stopped doing it.
For our next projects, I want to do actual unit tests to test critical parts of our code. Integration tests are something that should be used for finished features that are not very likely to change a lot, and I guess that means you have to keep that for a later time in the production. If you have experience with successfully applying either, I would like to hear about your experiences.
Peer Reviews
This is one of the best tools in our belt to improve build stability. It not only gives you a substantial improvement in build stability, it also fosters communication within the team and distributes knowledge (win – win – win).
The idea is pretty simple: Whenever someone wants to check in a change, he needs to get this change reviewed by one of his colleagues. Of course, this will only work if it is taken serious. The goal of a review should be that the reviewer has a good understanding of what the change is, how and why it was done.
There are no dumb questions during a review. If you do not understand something while you do a review, ask. This goes especially towards seniors or leads that sometimes might feel they should not ask dumb questions. If you think you even need someone else's opinion get it. You may and should criticize style and details. Ask for additional or improved comments if you think they might help. This is not only about making sure the change works; it is about sharing ideas and knowledge as well.
So, what do you get in the end? Reviews will easily spot obvious issues or problems with the idea of how to solve the issue at hand. They will rarely spot really intricate bugs or side effects. By that, it will remove quite a number of bugs that would have been found later by the automated systems, by the QA, or even worse, by somebody trying to use a broken tool.
What you also will get is people learning from each other, people looking into parts of the system they would not see usually. At least two people know the change that has been made in detail, so people getting sick or leaving the company becomes less of an issue.
You get a culture of talking about your work and making sure work is actually done before the check-in (it is pretty embarrassing if obvious flaws are discovered by your reviewer in a piece of code that you actually considered worth checking in). People in your team talk, they develop a common language, they understand weaknesses and strengths of the team members.
A few things to keep in mind to make peer reviews work:
- it costs time – make sure everybody knows that this is time well spent and factor it into your estimates
- every checkin is reviewed – a lot of mistakes are made with "easy" or "small" checkins
- people should be available for a review – nothing is as annoying as not being able to check in just because nobody has time, therefore you should have a damn good reason to refuse a review
- add the information about who did the review to the check-in comment – reviews will be taken a lot more serious that way, and if you hunt a bug caused by a check-in, you know the two guys you should talk to to help you