Sponsored By

Statistically Speaking, It's Probably a Good Game, Part 2: Statistics for Game Designers

Designer Tyler Sigman (Age Of Empires DS) continues his article series by contributing a 'completely serious and academic' (ahem!) look at the usage of statistics in game design and focus testing, in this exclusive Gamasutra cover feature.

Tyler Sigman, Blogger

January 24, 2007

26 Min Read

Welcome Back. I’ve Been Waiting…

If you’re reading this, then chances are you also read Part 1, “Probability for Game Designers.”

If you haven’t read it, you really should, and that’s not to say it is full of good stuff (the article is tripe, actually). I just recommend reading it because if you don’t, you might be unprepared for the silliness that may ensue during this serious *ahem* and erudite *cough* discussion of statistics.

This article focuses on a few select statistical topics that I believe should be understood by game designers. In particular, statistics really is useful and important for system designers, mechanicians, balancers, and other subclasses of designer that are usually relegated to steerage.

Disclaimer taken care of, let’s move on to the fizzy stuff!

sigman_02_clip_image002.jpgsigman_02_clip_image004.gifsigman_02_clip_image005.gif

Statistics: A Two-Drink Minimum Science

Although heavily grounded in mathematics, statistics is...well...weird! Seriously - if you ever have to start dealing heavily in two-sided confidence intervals and Student’s T-tests and chi-squared tests (or anything else squared, for that matter), it can get a little hard to digest at times.

sigman_02_clip_image007.gif
The Secret Badge of Statisticians Everywhere

You see, people like me really prefer physical metaphors. I’ve always liked physics and mechanics, because a lot of the time you can give yourself a reality check simply by analyzing reality. When you’re calculating the rate and direction at which an apple falls from a tree, you can reality check it in your head if your result says the apple should shoot off straight upward at 1,224 MPH.

At its best, statistics is understandable and rational; at its worst, it’s a little strange. Hence, I recommend libations and togas for any involved statistics discussion. I have asked the fine editors at Gamasutra to provide such togas and an open digital bar. What, didn’t you get your passcode? Hmmm, weird.

In any case, the topics in this article aren’t weird at all. For the most part, they are tangible, crunchy bits of statistics that you can develop gut feels for.

Statistics: The Dark Science

Statistics is, of all the sciences, the one that is very prone to misuse by the Forces of Evil. That is, if you had to attribute one science to the villain you are creating for your new book (you are writing a book, aren’t you?), you could do much worse than pick statistics. You could also give him a cape, dress him in black, and refer to him as “The Spider” or “Mr. Jones”, but I digress.

The reason that statistics can be loosely compared to villainy is that, used improperly, this branch of science can be called upon to infer all sorts of relationships that aren’t actually meaningful or even true (see the end of this article for an example of what I mean). When in the hands of politicians and other ne’er do wells, this can guide big decisions. Big decisions based upon inaccurate conclusions are never good.

All this is to say, statistics is incredibly useful and helpful when used properly. But like any stuperpower, it can be applied in nefarious ways, or even just plain dumb ways.

Statistics – What’s All The Fuss About?

I was going to crack my knuckles and write a tight summary, but then noticed that Wikipedia already had something that was darn near poetry. Here it is:

Statistics is a mathematical science pertaining to the collection, analysis, interpretation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities; it is also used for making informed decisions in all areas of business and government. (Courtesy Wikipedia.org)

That’s actually a very moving passage. In particular, the last bit is the tour de force of the paragraph:

...it is also used for making informed decisions...

Of course, the writer forgot to add “in game design,” but we can forgive him his condescension towards our burgeoning industry.

Here’s my own try:

Statistics is a mathematical science that deals with collecting and analyzing data in order to determine past trends, forecast future results, and gain a level of confidence about stuff that we want to know more about. (Courtesy Tylerpedia)

And if I were to modify it for Game Design, I would say (and am, in fact, saying):

Statistics can help you shine a flashlight upon your broken mechanics and shattered design dreams. It does this by giving you actual hard, scientific data to support meaningful design decisions.

What Do We Need to Know?

Statistics, like any hard science, is deep and complex. Like the tour of Probability in Part 1, this article only touches on a few selected topics that I, in my unlimited hubris, have deemed Important Enough to Know®. (Yep – unlike the many TMs I throw around, this one is so potent it’s registered!)

Pop Quiz Again

I’m sad to say that I have resorted to another test. Don’t hate the Quizza, hate the Quiz.

sigman_02_clip_image011.gif
A Taxed Quizzee

Q1a) Focus testers have just finished playing through a level in your new snail racing game “S-car GO!” Twenty testers played, and your are informed that the lap times came back in a range from 1 min 24 seconds at the low end to 2 min 32 seconds at the high end. You were expecting an average time of 2 minutes or so. Was the test a success?


sigman_02_clip_image012.gifsigman_02_clip_image012_0000.gifsigman_02_clip_image014.gifsigman_02_clip_image012_0001.gifsigman_02_clip_image012_0002.gif

Q1b) You collect more data for the same level, do some analysis, and find that the stats are: mean = 2 min 5 sec, standard deviation = 45 sec. Should you be satisfied?

Q2) You design a casual game that will surely soon be the talk of soccer moms everywhere (an admirable goal). In final QA, you release a beta build and then take data on a whole bunch of trial sessions. Over 1,000 play sessions are recorded, with over 100 unique players (some players were allowed to play repeated sessions). Crunching the data shows a mean score of 52,000 pts with a standard deviation of 500 pts. Is the game tuned up enough to release?

Q3) You design an RPG, and then collect data on how fast it takes new players to progress from level 1 to level 5. The data comes in as follows: 4.6 hrs, 3.9 hrs, 5.6 hrs, 0.2 hrs, 5.5 hrs, 4.4 hrs. 4.2 hrs, 5.3 hrs. Should you calculate the mean and standard deviation?

Populations and Samples

The base of statistics is the analysis of data. When dealing with data, there are two main terms that you need to know:

  1. Population: the entirety of a field for which measurements are to be taken. The population is arbitrary, and is dependent only on what you wish to measure. For example, say you want to know what people think about a particular issue. Your chosen population could be all of the people on earth, all of the people in Iowa, or just all the people on your street.

  2. Sample: a portion of the population for which measurements are actually taken. For very obvious reasons, it’s often too hard to gather data for an entire population. Instead, you gather data for a portion of the population. This is your sample.

Accuracy and Sample Size

The strength of a statistical conclusion is extremely sensitive to the size of your sample.

In a perfect world, you’d always like your sample size to be equal to your population--that is, you want to collect data on the entirety of whatever matters to you! Because anything less means you have to infer trends (a mathematical inference, but an inference nonetheless). Furthermore, the more data points, the better; you’d rather have a giant population than a tiny one.

Marketers and politicians would give their left brains to get a sample that is equal to their (large) population of interest. For example, instead of polling 10,000 junior high school kids to get an idea of how they feel about Fruit Roll-Ups®, imagine if they could poll *every junior high school kid*. Failing that, polling 1,000,000 would be super. Failing that, 100,000 would be dang nice. Failing that…okay, 10,000 will do.

It is for reasons of time and money that studies are performed on samples rather than entire populations.

  1. The Common Sense Rule of Statistics: mo is bettuh

You can’t predict a trend with one data point. If you know I like chocolate ice cream, you can’t draw any meaningful conclusions about what all Sigmans like. Now if you ask many members of my family, then you might be able to draw a reasonable conclusion about what the rest think...or at least know *whether* you can draw a reasonable conclusion. Ain’t stats fun?

Population Explosions and Wide Distributions (BEEP! BEEP!)

For reasons that only The Big Guy can explain, many things in life tend to follow similar patterns, or distributions.

One of the most common is the aptly-named “normal distribution.” That’s right, anything not matching this is abnormal, and therefore weird (and should be shunned appropriately).

The normal distribution is also known as a “Gaussian” distribution, primarily because “normal” doesn’t sound scientific enough.

The normal distribution is also commonly called a “bell curve” because, well, just look at the durned thing, will ya!?

sigman_02_clip_image016.jpgNormal “Bell Curve”
Standard Form (variance of 1, mean of zero)
*Image Courtesy Wikipedia.org

The distinguishing characteristics of a bell curve distribution are that most of the population are clustered closely around the mean, or average, value, and comparatively few are scattered at the extremes (high or low). This middle-clustering leads to the bell-curve appearance; the highs and lows are the flange of the bell.

We see the bell curve around us in a million different things. If you measured the heights of all the people in your city, they’d probably match this distribution. That is, a tiny few would be super-abnormally short, a tiny few would be super-Yao Ming tall, and a great many would be within a few inches of the average.

The bell curve typically holds true whenever you are looking at people’s skill levels, too. Take sports - a tiny few are good enough to play professionally, a great many are good enough to get by, and a tiny few are so bad that they don’t get picked to be on teams (like me).

Other Distributions

The normal distribution, despite being swell, isn’t the only distribution around. It’s just amazingly common.

For examples of some additional distributions that are directly related to gaming and game design, just take a look the probability distributions of dice throws, in this case a d6 and then a 2d6 throw:

sigman_02_clip_image018.gif
D6 Distribution

 

sigman_02_clip_image020.gif
2d6 Distribution

In part 3 of this series, which should hit Gamasutra shelves around 2010, I’m going to spend a bunch more time talking about these dice distributions. For now, all I’m going to say is that the first one looks nothing like a bell curve, whereas the second throw is starting to resemble one (but still isn’t quite there yet).

Means to an End

Consider this tiny section an intermission embedded within an otherwise tedious article. This tiny, self-referential section serves only one purpose in life: to remind you of what a “mean” is. This tiny, self-referential, and pedantic section would like to passively remind you that a mean is the mathematical average of a set of data.

This tiny, self-referential, pedantic, passive, and well-meaning section hopes that you take something meaningful away from reading it; for it is now that this tiny, self-referential, pedantic, passive, and pun-throwing paragraph must end.

Variance and Standard Deviation

Variance and standard deviation are very important to understand, and have a lot of tangible value. Aside from helping us draw valuable statistical conclusions, these terms enable us to speak a lot more intelligently about distributions. Instead of saying “a great many data points cluster about the middle”, we can say “68.2% of the sample falls within one standard deviation of the mean.” Chicks dig that speak; guys dig that speak; heck, who doesn’t dig that speak?

sigman_02_clip_image024.jpg
Normal Distribution with Standard Deviation Bars Shown
*Image Courtesy Wikipedia.org

Variance and standard deviation are related to each other, and they both measure the same thing: data scatter. Intuitively, a high variance or standard deviation means your data is all over the place. When I play darts, I get a high variance in my throws.

Variance and standard deviation can be easily calculated from any set of data that you have. I’d put the equations in here, but that would break my “don’t sound like a textbook” rule. So instead of an equation, here’s a description:

Standard Deviation: the average amount by which data points in the sample or population differ from the mean. Standard deviation is represented by the Greek letter σ (sigma)

In other words, say you test 100 people on how long it takes them to complete Level 1 in your newest game. Let’s assume the average (mean) of all the data is 2 minutes 30 seconds. Now assume the standard deviation calculates out to be 15 seconds. This standard deviation indicates that the grouping or “clumping” of the play sessions. In this case, it’s saying that on average, play sessions are within ±.25 minutes of 2.5 minutes. That’s pretty consistent.

What does this mean and why do you care? Easy. Pretend that instead of the above results, you got these results:

Mean = 2.5 minutes (same as above)
σ = 90 seconds = 1.5 minutes

So here we have the same mean but a vastly different standard deviation. This set of numbers means that you have much more scatter in the play times. On average, play times are about 90 seconds off of the mean play time. Given that the mean play time is only 2.5 minutes, that’s huge! And it’s probably not good to have that much scatter, for various game design reasons.

It would be much different if you were talking about a standard deviation of 90 seconds (1.5 minutes) on play times of 15 minutes.

Consistency is measured by a small standard deviation. Ratio your standard deviation against your mean to get a good warm-fuzzy number. In the first example, 15 sec / 150 sec = 10%. In the second, 90 sec / 150 sec = 60%. A standard deviation of 60% is bigggggg with indulgently repeated g’s. In the third, 90 sec / 900 sec = 10% again…respectable.

This is not to say that a large standard deviation is *always* bad. Sometimes as designers we want a large standard deviation in whatever we’re measuring. But a lot of times it’s bad, because it represents a lot of scatter and variability.

The important thing is that calculating standard deviation will tell you a lot about your game/mechanic/level/etc. Examples of useful things to measure standard deviation for:

  1. Level play times

  2. Whole-game play times

  3. Number of combat rounds it takes to defeat a typical enemy

  4. Number of coins collected (games with small Italian plumbers)

  5. Number of rings collected (games with fast, blue hedgehogs)

  6. Times controller is thrown at screen during your tutorial

Margins of Error

Margins of Error go hand in hand with statistical conclusions. Think of every Gallup Poll you’ve ever seen; there is always a margin of error expressed, such as ±2.0%. Because polls are using samples to estimate a population, there can never be 100% confidence (see later in the article). Margin of Err.0or indicates how accurate the results are. It is absolutely vital to know Margin of Error whenever you are talking about a population bigger than your sample.

If you take data on your entire population, then theoretically you don’t need a Margin of Error – you already know all the data! For example, if I ask everyone on my street whether they prefer Chess or Go, then I don’t need a Margin of Error as long as I am just reporting about people on my street. But if I want to draw a conclusion about everyone in my town based upon the data points from my street, then I have to calculate Margin of Error.

The bigger your sample size is, the smaller your Margin of Error. Mo data is bettuh.

(Self-)Confidence Intervals

You can use inferential statistics to draw conclusions about future data. One useful trick is the calculation of confidence intervals. Conceptually, confidence intervals are closely related to standard deviation, and are basically a mathematical way of saying how certain we are that a given piece of data will fall in a specified range.

Confidence interval: a mathematical way of saying “we can guarantee with A% confidence that B% of the data will be between values C and D.”

That’s a mouthful. But it’s useful to know, with a specified amount of confidence, what a value is likely to be. For a good example, I’m going to step back into my previous career for a blissful yet ultimately unsatisfying moment:

I used to do stress analysis and design of aircraft bits and bobs. If you know, or need to know, anything about aircraft - and commercial aircraft in particular - it’s that it is the most regulated form of transportation that exists. People don’t like it when wings fall off of planes. ‘nuff said.

One of the methods we engineers use to keep said wings on said planes is designing to a very high confidence interval of material strength properties. A typical confidence interval used for aircraft design is the “A-basis allowable”, which means we are 95% confident that 99% of the values in any given shipment of a specified material fall above a certain value. Then, we design to that value against the worst possible air conditions, and then finally apply a big factor of safety on it. Gotta be sure.

Confidence intervals are very informative and useful whenever you *really want to know* what kind of data values to expect. Fortunately, games are not typically a matter of life and death, but if you are trying to balance an (unpatchable) console game, you probably want to have more than gut feel and intuition to go on. Calculating confidence intervals could be used to give you hard facts about how your game plays, and whether there are obvious exploits.

Whenever you want to calculate good confidence intervals, the ol’ standby rule of statistics still holds true: mo is bettuh. The more data points you have in your sample, the better your confidence interval calculation will be.

You Can Never Be Sure

This brings up another rule of statistics (and probability, actually):

100% Does Not Exist: You will never achieve a confidence interval of 100%. You can never guarantee through inferential statistics that a predicted data point will be of a certain specified value.

The only sure things in life are death, taxes, and the inability to find the last Yeti Hide you need when trying to complete a World of Warcraft quest. Accept these facts and move on.

Misappropriation
I mentioned earlier that statistics works as a skill of villainy. To illustrate why, I wrote this short, bullet-form love poem:

Sonnet 1325: Beautiful statistics, let me count the ways that I abuse and misuse you.

  1. Misunderstanding

  2. Not stating confidence intervals

  3. Discarding valid conclusions because you don’t like them

  4. Drawing conclusions based upon flawed or influenced data

  5. Sportscaster errors – blending errors of probability and statistics

  6. Drawing conclusions based upon unrelated factors

Misunderstanding
People misunderstand statistical statements all the time. I know, it’s hard to believe.

Not Stating Confidence Intervals or Margins of Error
Confidence intervals and margins of error are vital pieces of information. There is a huge difference between saying 43% of PC owners have purchased a downloadable game in the past 30 days (Margin of Error 40%) and the same statement with a MoE of 2%. When MoE is left out, always assume the worst. Remember, small sample = high MoE.

Discarding Valid Conclusions Because You Don’t Like Them
When used properly, statistics don’t lie. But people lie to themselves all the time. We see this a lot in politics, where statistical studies will be ignored simply because the conclusions don’t match those that were hoped for. Same thing sometimes happens with focus groups. Of course, we also see statistics misused terribly in politics, so it’s a wash, I guess.

Drawing Conclusions Based Upon Flawed Data
This one happens a lot, especially in market research. Your statistical conclusions are only as good as the data you make them from. If the data is flawed, then the conclusions are worthless. Flawed data can come in a variety of forms, with causes ranging from honest errors to severe manipulation. Asking loaded questions is one easy way to get flawed data that supports whatever conclusion you were hoping to make anyway. “Do you prefer Product X, or that crappy Product Y that only idiots use?” quickly leads to seemingly bullet-proof statements like “95% of consumers prefer Product X!”

Sportscaster Errors

Sportscasters are the shamans of our day. They take a little statistics, a little probability, a little gut feel and then mix them together to make something terrible. If you ever want to see a bunch of statistics thrown around with tenuous conclusions that typically have no basis, just watch a football game.

For instance, an announcer might say that “Team A hasn’t blocked a kick against Team B in the last 5 games.” The dangling conclusion is that Team A is less likely to block a kick than if they had done so in the last 5 games versus Team B. But you could say the same about the reverse--maybe they are more likely since they haven’t blocked one in a while!

The truth is, there isn’t enough information to say either one. And it’s probably more a matter of probability, anyway. Does the chance of blocking a kick really depend on whether one was blocked the game before? They are probably independent events, unless there are recognizable interrelated factors.

This is not to say that all sports conclusions are flawed. Statistics is very important to baseball, for instance. Statistical analysis sometimes guides what pitch is thrown or what the batting lineup will be.

It all comes down to data: when you have a lot of data, you get better statistical conclusions. Baseball supplies a lot of data: almost 200 games per season! With football, there almost just aren’t enough games to go around. Margins of Error are bigger. I’m not exactly saying statistics is never useful for football...it is just harder to mine useful, contextual data.

Drawing Conclusions Based Upon Unrelated Factors
People misunderstand statistical statements all the time. Rather, using compared relationships, it’s easy to infer deeper relationships that don’t actually exist. My all-time favorite example of this is the well-known Pirates vs. Global Warming graph featured in the CHURCH OF THE FLYING SPAGHETTI-MONSTER’S Open Letter to the Kansas School Board:

http://www.venganza.org/about/open-letter/

Please, for the love of all that is statistical, go look at the graph contained in that article. PLEASE, I BEG YOU!

Please, Can We Just Bookend the Quiz and Be Done?

Okay, okay, I hear you.

Q1a ANSWER – Level Times
The answer to this one is easy: you haven’t been given enough info to calculate the average yet. Just because the values ranged from 1:24 to 2:32 doesn’t mean they average out at 2 minutes. (Those two numbers average to 1.97 minutes, but we don’t know the other 18 results!) You need to know all 20 results to calc the average, and you really ought to calc the standard deviation as well...see below.

Q1b ANSWER – Level Times Part Deux
Okay, in this case you probably shouldn’t be satisfied because the standard deviation is pretty high...over 40% of the mean. This sounds like a bit too much variation in your level. There is potentially a sizable exploit that skilled players are using to their advantage. Alternatively, you might be punishing less-skilled players too much. As the game designer, you ultimately have to be the judge as to whether these results (high variation) are intended.

Q2 ANSWER – Soccer Moms
Stats only gets you part of the way there; you still need game design smarts. In this case, the score grouping is *way too close*...to have a standard deviation that low (500 / 52000 = 1%) means you are getting hardly any score variation, which means in turn that differences in player skill aren’t really mattering in the end game result. Therefore, players will most likely be turned off because they won’t see much of a progression in their scores as they get better at the game.

Here’s a situation where you’d really love to see a much higher standard deviation, because that hopefully shows that increased skill leads to increased scores. In other words, your current game scores the same no matter who plays it.

Q3 ANSWER – Play Times
This one is sorta tricky and underhanded but illustrates an important point about data collection: you need to watch out for obviously bad data. That one value, 0.2 hrs, looks suspiciously like an error. Could be a typo, could be an equipment malfunction, who knows. In any case, you should either convince yourself without a doubt that the 0.2 hrs is a valid data point before doing any calculations with it, or just throw it out and perform your calcs on the remaining data points.

Insert Other Cool Stuff Here

In efforts to keep this article under 723 pages, I have to skip over many other intriguing topics. Suffice it to say that a good understanding of statistics will help not only your game design, but your consumer decisions, voting decisions, and financial decisions. I’m 23.4% sure that at least 40% of what I just said is true.

As a designer, statistics is most useful when crunching data from a set of recorded play sessions (your sample), and trying to form conclusions about a larger field of unrecorded play sessions (your population).

Learn By Doing

For example, in the game I just finished, we recorded data from play sessions and then set challenge levels in the game based upon the mean and standard deviation values from those recorded data. We set Medium difficulty to be equal to the mean values, Easy difficulty to be equal to the mean minus a certain amount of standard deviations, and then Hard difficulty equal to the mean plus a certain amount of standard deviations. Had we collected much more data, it would’ve actually been accurate!

Just like probability theory, statistics becomes more and more useful the bigger and bigger the scope of your project. A lot of the time, you can fumble your way through without applying any formal theory in either case. But the bigger your game, the bigger your audience, and the bigger your budgets, then the more there is to risk from embedded flaws in an unbalanced, seat-of-the-pants designed game.

Stats, like probability, won’t do your game design work for you. It’ll just help you do it better!

The Long Road Ahead

In the rousing conclusion to this series, I’ll be taking bits from parts 1 and 2 and then putting them together in ways that actually have some relevance to games. Or I’ll croak trying!

Thanks for reading, and ciao for now.

sigman_02_clip_image032.jpg

Attributions:
*The Wikipedia images used in this article are licensed under the GNU Free Documentation License.

Read more about:

Features

About the Author(s)

Tyler Sigman

Blogger

Tyler Sigman (he/him) is the co-president, co-founder, and game design director for Red Hook Studios, makers of Darkest Dungeon I and II. He has designed over a dozen other published videogames and boardgames, including the BAFTA-nominated turn-based "Age of Empires: The Age of Kings" (Nintendo DS), the twin-stick dragon shooter "HOARD" (PC, PS3), the boardgame "Crows" and more. His favorite game of all time is Sid Meier's Pirates! for C-64. He can be reached at tyler at redhookgames dot com.

Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like