How Rare Automates Testing for AI (and More) in Sea of Thieves (Part 4 of 4)

How Rare playtest Sea of Thieves using automated testing, enabling over 100 internal deployments before public launch.

June 26, 2019

14 Min Read

AI and Games is a crowdfunded series about research and applications of artificial intelligence in video games. If you like my work please consider supporting the show over on Patreon for early-access and behind-the-scenes updates.

'The AI of Sea of Thieves' is released in association with the UKIE's '30 Years of Play' programme: celebrating the past, present and future of the UK interactive entertainment industry. Visit their website for links to interviews, videos, podcasts and events.

Over the last three entries of AI and Games I've looked at the various AI systems behind Sea of Thieves with insights from some of the developers involved at the studio . Making games - like any piece of software, is a monumental task when taken from earliest concepts all the way to the final finished product. So in this final entry, we're going to look at the test-driven development philosophy Rare adopted for this project. How it enables developers to catch many of the hidden bugs in AI and gameplay systems before they ever reach players and the internal deployment tools built to enable Sea of Thieves to be updated and released to players quickly.

Game Dev, Bugs and Bug Fixing

But first, I want to talk about how games are typically built. When a video game starts development it's traditionally built around milestones whereby a game exhibits specific features in relation to the core design; this is how it transitions from a design document or idea among a team into a final product. Programmers start by moving towards first-playable, where the game is often simple and reduced in scope, to alpha, where all mechanics are built, but art assets are incomplete. Followed by beta where the gameplay mechanics are complete as all the art, audio and animation assets are integrated. While a beta build is functionally complete, it's often still got lots of smaller lower-priority bugs that need to be addressed. The project then migrates from Beta to Gold master - often referred to as the game 'going gold' - where it's ready to be pressed to disk and hosted on online storefronts for pre-loading and such.

Now games like any software projects are a complicated and challenging process to build and as they increase in scale and complexity it is inevitable that bugs and other issues will begin to manifest. This ranges from features or functions that don't perform as expected, to performance overheads when playing or the games might just crash and break down while running. Now typically, a lot of the more critical bugs are resolved during development given they're impeding a new feature from working as expected. But given the need to ensure features are complete and performing as expected, a lot of the smaller bugs will be dealt with towards the end of development once the game is feature-complete. However, it's not just the issue of bugs you know that exist, but the many little ones you didn't even know were there and the new bugs you'll create by fixing existing ones.

Given the feature-driven approach as we transition from alpha to beta to gold, these bugs are typically what dominates the closing stages of a games development as we polish up the final product to reach the standards that we as developers seek and subsequently release it to customers and clients. It's often why games have update patches on digital storefronts in the opening weeks after launch given there are still things being found and remedied after the game has gone gold and sent to be burned on retail discs. Squashing these bugs in run up to launch can be one of the most strenuous periods for any development team given the deadline is fast approaching and that bug lists needs to be nailed down. It's something I've experienced myself both in software development, but also when shipping my own games too as the project only really starts to come together in the closing weeks as everything behaves as intended.

However, for Sea of Thieves, the lead engineering team at Rare realised that not only is this approach ill-suited to game development, it's simply unsustainable when working on a live service game. If your game is being updated with new content all the time, then the traditional 'finish line' for launching the product never materialises, it just keeps moving farther and farther away from you and that can have an adverse effect on the development team. Plus as new features are constantly being added into the game, there's a very real chance that all of that hard work in squashing bugs for the last release is undone as new elements are added to the game. As such, when starting work on Sea of Thieves, Rare's engineers adopted a process known as 'test-driven development': whereby tests are embedded in the project that check specific features or functions of the game still operate as intended. When new code or features are to be added to the game, the new code needs to pass all tests in the project (including new ones written for that feature) before it can be safely integrated. This process of testing for bugs as new code is committed to the project can reduce the number of issues that secretly arise as it moves through each milestone and ultimately minimise the workload for bug fixing as a game approaches a release deadline. Test-driven development is a recognised practice within the software industry as a whole, but it's not adopted as heavily as it could be, often because of the overheads in setting it up and working within that kind of environment.

To better understand the approach taken by the Sea of Thieves development team, I spoke with two of the more senior developers at Rare I'd met during my stay: first Rob Masella (a senior gameplay engineer) as well as Andy Bastable - the lead gameplay engineer on Sea of Thieves. Each of them were instrumental in adopting this test-driven philosophy and in helping the team adopt the process throughout the entire project. As Andy and Rob explained to me, there was a real desire to bring automated testing to the project, having the challenges faced when testing the Kinect Sports series - where they required a significant amount of human testing to ensure features worked as intended - but also because it made sense for the studios transition towards building live-service titles.

There was a real drive to try and make the codebase as bulletproof as possible to sustain itself in a live-service capacity. That continued updates and changes as new features were added wouldn't result in old bugs resurfacing again and again or just break established systems due to small errors on the developers part.

Addressing Bugs

The suite of tests is built into the codebase such that every single time a member of the team attempts to commit their work to the projects source code repository, there are tests ready to check whether the changes this person has proposed will break the game.

Now the vast majority of the tests that run in the project are what we call unit tests written in C++ that test individual pieces of functionality. Does the skeleton equip the item expected? Do they shoot a player when told to do so? But there are also integration tests: where you need to test that two separate things that pass their own unit tests still work correctly when put together. This allows for more complicated gameplay logic to be tested out as well and is often achieved using test maps and blueprints in Unreal Engine (shown below). This actually extends to the behaviour trees as well, with special integration tests built to ensure the AI is switching between behaviours or executing action sequences correctly. Each test then returns either that it has passed or failed accordingly.

So let's take an example of this in action and how this benefits the games development. Let's consider a scenario where a skeleton AI sees the player, then losing line of sight of them. The skeletons are designed to subsequently search the last known location of the player in an effort to find them. Now this is a rather simple feature of the AI's behaviour, but there's 101 reasons why it might break: a new feature could be added to the game that breaks the perception code, or a change is made to the memory management of targets to skeleton actors that accidentally stores the player location in the wrong variable, or forgets to store it all. There could even be a new player feature that accidentally makes them invisible to the skeleton. Now in any case, traditionally this would be a bug that would need to be spotted by a human QA tester - and there are still people working in QA for Sea of Thieves - and then report it back to the developers to fix. Or in the worst case, it could actually make it into the retail build of the game and it's only spotted when players report it to the Sea of Thieves community teams. In each instance, it takes longer and longer for the developers to find out that bug is in there and to get around to fixing it. Plus that requires having developers dedicate their time and resources to fixing a bug that shouldn't be there, instead of working on all the new features they're adding to the game.

So the developer fixes the bug, pushes the change, it gets released to players and everything is great. But then the bug re-appears a week later because a new feature in the game changes how skeleton perception works. This means that someone has to catch that bug again, report it again and a developer needs to fix it... again. This is where the testing becomes valuable, it allows developers to focus their efforts on new features for the game, while the QA and community teams can dedicate their time to evaluating the overall game experience and feeding that back to the core design teams.

Now I picked this example, because this is an active test in the Sea of Thieves codebase as shown above. So when changes are made to the project that break this search behaviour, the development team will know given the test returns as failed. This allows the developers involved with the feature to go back, re-examine how this happened and improve the changes to remove the new bug that was just created.

How Testing Works

So how does all of this testing work in conjunction with day-to-day development of the game? The testing setup is built using an extension of the existing Automation System built into Unreal Engine, but there's also an external tool built by Rare's engineers to run tests outside of the UE4 editor. But perhaps more interestingly, the tests can be run on built executables as well as remotely in the studios build systems linked to the source control repository of the project. Through use of TeamCity, it can build new instances of the game in an allocated build farm and then run automated tests against the current build of the game.

When a new change is to be added to the project - and this means any code change, including anything to do with all of the AI systems I've discussed in this series, then a developer needs to test it before it gets committed. The developer needs to show in a code review - where they sit down with another developer to check the code is ok to submit - that their changes pass the relevant tests before pushing to the source control repository where the project is stored. But after that, it still needs to successfully pass all of the tests sitting on the projects build server. This is important given a change might not appear to cause any problems on the tests considered relevant for the feature, but it could cause a larger problem somewhere down the line. The server runs all of the smaller tests roughly every twenty minutes, meaning that new bugs that a arise can often be caught quickly in the event another test spotted something the developer originally missed. As we visited the barn where the development team was working, we could see notification screens all around us. This shows the current state of the project and allows for the whole team to quickly identify whether whether recent commits have broken anything and what actions need to be taken in the event the build has broken. While we were in the studio, all the screens were green, indicating that the build was fine and no issues had arisen. However, in the event a test had failed, it would have turned red, meaning no further commits are permitted until a change is rolled back for a developer to go and fix it.

Now beyond the tests I've already mentioned, the system runs the larger, more complex tests overnight including multiplayer and performance tests. Multiplayer integration tests are more complicated given they require numerous players to be active in the same instance of the game. The larger and most complex forms of integration tests require the system to simulate the multiplayer activity. While this extends to all sorts of gameplay systems, this is highly relevant for the AI, given that as mentioned in part 1 - all of the AI behaviours are running on the server. So these multiplayer integration tests run a client over the network to enable simulation of multiplayer behaviour. It relies on virtual server and client processes whereby the game runs tests server-side to ensure it's running as expected, then shift to client to make sure it's all following on smoothly. But on top of this, it also runs performance tests, network latency tests and platform tests. This is useful in situations when a new change might not fail any tests, but while the build runs fine on PC it causes an unexpected performance overhead on Xbox One. Similarly, a lot functionality that is designed to run online is being checked on an internal network and doesn't reflect real internet connectivity, so latency tests help identify where AI and other gameplay systems might break should the player have a poor or unstable connection.

Culture in the Studio

Now in the event a change fails any of these tests after commit to the source control, it's removed from the build and sent back to the developer who committed the work alongside a notification of what failed. Now as a programmer myself, I wondered what impact that would have on the developers themselves. With these big screens presenting when the build was stable or broken, how do you manage this kind of culture? For me what sounds rather daunting is when a programmers work is sent back to them and they're told to fix it because it failed a test, which is now visible on screens through the studio? But as Rob explained, it all comes down to it being a collaborative process, avoiding blame culture and that all programmers - regardless of their experience - still make bugs in their code.

Despite this huge change in development culture, it hasn't eliminated the need for human input. There are still human players who work as QA and play new builds, but also there's players on the Sea of Thieves Insider Programme, who get to play beta builds with upcoming features. However, what's really interesting is that the focus of these teams is drastically different from before. With the QA team looking at polish and other issues that tests might not capture, while the Insiders are more focussed on giving feedback on the game itself.

Ultimately, these tests ensure that the vast majority of game-breaking bugs never make it out into the wild and enables continuous delivery of the game to run as smooth as possible. It enabled Sea of Thieves to be deployed internally during development consistently. Not just a broken set of isolated levels or demos, but a complete build of the current version of the game was available within the studio for testing over 100 times, making the final launch a lot less daunting.

About the Author(s)

Tommy Thompson

Blogger

See more from Tommy Thompson

Related Topics

Related Topics

Recent in More

Related Topics

Related Topics

How Rare Automates Testing for AI (and More) in Sea of Thieves (Part 4 of 4)

Game Dev, Bugs and Bug Fixing

Addressing Bugs

How Testing Works

Culture in the Studio

About the Author(s)

Latest News

Trending

Featured Blogs

Game Developer Essentials