Sponsored By

Rising from the Ranks: Rating for Multiplayer Games

Scores and ranks have been used in arcade and single player games from the early days on. Players itch to compare their performance, and they want to know where their peers and equals are. Now, with games like Quake3: Arena and Unreal Tournament, geared towards one-on-one and free-for-all deathmatch, the question of proper rating and ranking is bound to return with a vengeance. Bernd Kreimeier is out to settle this once and for all.

Bernd Kreimeier, Blogger

February 9, 2000

36 Min Read

Scores and ranks have been used in arcade and single player games from the early days on. Players itch to compare their performance, and they want to know where their peers and equals are. With games like Quake3: Arena and Unreal Tournament geared towards one-on-one and free-for-all deathmatch, the question of proper rating and ranking is bound to return with a vengeance. Let's settle this once and for all.

The Fun Factor

If I were pressed to define what makes an interactive game "fun", I would put my tongue firmly in my cheek and come up with something you could call a "control theory of games": A game can be experienced as "fun" if you can make a difference.

Take this as the "weak" causality of fun. It merely requires the game to be "interactive". In its minimal form it implies that your actions will have some effect on the outcome. In other words: if you might as well be absent, chances are you should leave.

Affecting the chain of game events doesn't necessarily mean that you are winning the match, which leads us to the "strong" causality of fun: A game can be experienced as "fun" only if you can win. The elusive quality we call fun originates from the confidence that you can, and might have, won the game - that your actions can indeed affect the chain of game events, maybe enough to let you come out on top. You don't have to win to enjoy the game, but you expect a fighting chance.

In his Glory and Shame, Jonathan Baron describes competitive multiplayer reality, and postulates that public humiliation drives away the casual gamer. Online gaming, especially when implemented as free for all deatchmatch, or one on one tournaments, offers "strong fun" only to a few, and little of the "weak fun" for many, if not most. Of course, not everybody can win, and certainly not at the same time. Yet, there are means to make the game more fun for more people. Inevitably, any such attempt requires a game to estimate player skill, and either match equally skilled players, or compensate skill difference

Estimating player skill is where it all begins, and that's what rating is all about. You are in for a bit of theory here, so I will give you the bottom line right away: if you are doing a competitive multiplayer game without proper rating, then your game design has a potentially fatal flaw.

DOOM introduced "frag" counting, and with it the notion that all players are equal. This, however, is far from true. Competitive multiplayer gaming has yet to attract the casual gamer, to find success beyond the hardcore gaming crowd. One reason is that frag counting encourages the strong player to prey on the weak. Scores create feedback loops: the winning strategy that "solves" frag counting is to find and kill the weakest player, thusly giving them the kind of initiation that will drive them out quickly, no matter how easy your GUI makes it to get on a server and join a game.

Skill estimates based on proper rating are the first step to change this inherently imbalanced game mechanics. Better yet, it takes only about a dozen lines of code to add a mathematically sound rating calculation to your game. You can even apply it on a per-session basis without introducing persistent player IDs (see Dynamix' Tribes2 for an example of player ID management), and still improve the situation.

Intermission Screen

Definitions: a score; is a simple counter triggered by events, e.g. "kills accomplished", "total number of items" picked up so far. Ranking is a measure by direct comparison: A beats B, thus A ranks above B. The ladder sorts all players based on the outcome of their last direct encounter: your place in the ladder is your ranking.

Rating extends ranking by introducing a scale: A is twice as good as B. Rating also differs from a score in that we do no longer simply count events - we weigh them. You will sort players by their rating, and derive ranks, but there is a bit more to it: rating attempts to measure your actual ability. Sports and competitive games have used rating mechanisms for decades now. The state of the art pretty much dates back to 1960, when Arpad Elo revised the rating system for chess, and we are following his footsteps.

From single player game scores, we can learn the properties any successful rating algorithm should have: it must be

  • Reproducible

  • Comprehensive

  • Sufficiently detailed

  • Well balanced

We need reproducible results. Given identical performance, the algorithm should end up with nearly identical scores. Players with comparable ability should end up with comparable rank. Reproducible also implies that the meaning of a given score should not change over time. This might be irrelevant for games with an estimated lifetime of a fiscal quarter, but the shelf live multiplayer games extends farther and farther.

Our scores also have to be comprehensive. The average player has to instantly and intuitively understand scores. Scores have to match the player's ideas of significance. Common examples like "confirmed kills", "deaths suffered", "shots fired", "hits per shot", "frags per minute" make sense even without conmparing them to other player's results.

Such performance indicators are also detailed enough. Detail and balance issues are raised by multiple scoring, a common problem for single player, but also for team multiplayer. Single player DOOM listed time, kills, items, and secret areas revealed - a combination that allowed players to subscribe to different styles like "fast" or "thorough". Multiplayer DOOM had only once score, the now classic "frag", and we will take this as our starting point. For the pure and simple point-and-click multiplayer FPS shooting game - no teams, not flags - the kill or "frag" is the base unit of measurement.

The whole purpose of scoring is compressing minutes if not hours of gameplay into a single number, for ease of comparison and ease of communication (i.e. boasting). For frag counting, a kill is a kill is a kill. Yet, human players are not at all identical. If all frags count equally, we encourage "newbie hunting", a gameflow in which all players dominate by preying upon the easiest victims, ignoring their peers - net result being an utterly imbalanced game. Any capable player who chooses to spare the struggling beginners will be beaten by those who target them specifically.

To change this, we have to assign a value to each single frag, rewarding those that target their peers. In multiplayer deathmatch, the quality of a kill does matter - the skill of your opponents matters. But how do we measure this elusive quality, skill?

kreimeier_01.gif

Figure 1. Example skill distribution for 50 players

Jacob's Ladder

Nobody really cares about numbers. The only thing the social animal does care about is rank. Your first attempt at a solution could be a ladder. Players join by challenges, from the bottom, or by choosing an opponent. Each time two players fight, the winner gets the rank of the loser, or keeps her own rank, whichever being the higher rank. Draws will be ignored. A ladder lacks all hints on how large the difference in abilities between two players really is, or any other linear scale of reference to compare players, thus making it hard to determine a worthwhile challenge. There is also no straightforward way two players could have equal rank.

The ladder has a few things going for it, though. The history of the Elo system teaches us that simplicity is an important property. Simple solutions are good solutions. The less detailed a ranking scheme, the fewer free parameters, the less argument between players. The process of match-and-swap offers us a first insight into the true nature of ranking: we are actually trying to solve a minimization problem here, namely an error function like

Error = sum over all players |true rank-current rank|^2

We have no means to measure the true rank except by comparison with another player - in other words, by observing a game. Each actual match is a sample, a data point during our quest for the true player hierarchy. Ranks might be time-dependend (if player abilities change), and our results are blurred by noise (because details we know nothing about may have affected the outcome of a match), but this is as close as we get. We sample (let 'em play), and update our estimates according to the result, hoping that, more often than not, we are getting closer to the truth, and that things will sort itself out as long as people keep playing all over the board.

You can see that this minimization task is combinatorial by nature (an assignment problem): all permutations of players on ranks (without repetitions) are possible solutions. This is a well known, hard problem, and our approach is equally well-known, and usually called iterative improvement by pairwise exchange:

if ( loser-rank = winner-rank )
swap( winner-rank, loser-rank )

We gained nothing if the outcome of the fight matches our expectations (we expect a higher-ranked player to win), otherwise all we know is that whatever the real ranks, changing places will at least get us closer. All we care about is who won - it does not matter whether the score was 2:1 frags, 100:99 or 10:1.

It might be necessary to introduce a penalty for ignoring challenges, as well as rematch regulations and random selection of opponents, or other incentives to keep things going. Pairwise exchange means high risk for the high-ranked player, and no risk for a low-ranked player. Restricting challenges to neighboring ranks slows down the dynamics. This means less instability if a game is not too reproducible (random noise affects results), but it will take much longer for the system to reach equilibrium and minimize the error. We will encounter this trade-off again.

Finally, pairwise exchange based on final frag count means that you have to enforce one-on-one gaming. If you try this in a Free-For-All situation, your results might turn out to be useless. Ladders are tools for tournaments, not so much for multiplayer crowds. However, ladders might work for more than two players on a per-game basis if you determine winner and loser per event, per frag. Iterative improvement works best if you swap often, to use all information available, and to avoid accumulated errors. The more players participate, the more adequate the ranking - another important lesson. The sorting obtained on a single server can afterwards be used to re-assign the slots in a global ladder as well.

kreimeier_02.gif

Figure 2. Example skill distribution for 500 players


Scores To Settle

Lets get back to the value of a frag. Our initial idea was that we want to accurately measure an unknown quality, the player's skill or "strength", her ability to compete against others. For simplicity's sake let's pretend this could be measured by a single number. Of course, in all likelihood this will be about as just and accurate as the infamous IQ, but at least it is some indicator. Let's call our estimate of the strength of a given player a PQ.

If we look at all our players, there will be lots of average players, some really bad ones, and some outstandingly good. This is called a probability distribution in PQ. Figure 1, 2 and 3 show example distributions for small and large player populations.

Here is our link to the outcome of an actual match beyond simply determining the winner. Remember that all our swapping and sorting is an attempt to approximate the players' strength - to solve the problem faster, we either need more or better samples, or we need to make better use of each sample. An algorithm that outperforms our simple pairwise exchange, converging faster to a better estimate of the true player ability, has to make use of more information as contained in a frag event. If only we knew the difference in PQ between the winner and the loser. But we do not know the PQs, so how do we get them?

kreimeier_03.gif

Fig 3: Example skill distribution for 50,000 players

Bootstrapping

If we don't have any information, we might as well start anywhere. Everything described in this article rests on making certain assumptions about the players, their real strength, and our ability to measure it. For example, we will pretend that PQ (and thus real strength) are linear, so we know we can put the zero mark wherever we want, as only PQ differences are presumed important. Our algorithm will be:

 

  • Initialize: determine initial PQ values

  • Repeat:

    1. Predict game result based on current PQ's

    2. Sample actual result from the game

    3. Update PQ's based on difference
      between expected and actual result

Just as with a ladder, we are effectively trying to minimize our approximation error by using game results as statistical samples.

As we will be iteratively improving our estimates, we might as well randomly choose a PQ value for each player, as a starting point. No matter how bad our initial estimate, eventually the algorithm will converge. We expect the ultimate results to resemble the examples of player populations in Figure 1,2 and 3 - bell-shaped, with a lot of average players, and a few good and bad ones.

In other words, we expect that the probability to encounter a player of a given strength can be approximated by a Gaussian distribution (thusly trusting in the Central Limit Theorem's assurance that the world gets simpler the more complex it is). Distributions can be described by writing down a probability density function (PDF). The histograms in Figure 1,2 and 3 measure frequency by binning (counting within a PQ interval). Dividing the number of players for a given PQ (interval) by the total number of players normalizes the result. Calling player strength (i.e. our estimate PQ) the argument x, the density function P(x) measures the probability to encounter a player with a PQ of value x. P(x) is a normal PDF if its integral over all possible values of PQ equals 1 (for any width). Figure 4 below shows such a normal distribution for mean zero and unit width (see also Figure 6). The width or standard deviation is the square root of the variance found in the data. The distribution also has a mean value. Many mathematicians prefer to describe P(x) by mean and variance instead of using the width.

// N normalizes the integral to 1 for a given variance
const double InvN = 1/(sqrt(2*pi*variance);
double P( double x ) {
return InvN*exp((x - mean_value)^2/(2*variance) );
}

In the real world, you will have to measure the variance of values for a given number of players N. The mean value can be set to zero as long as we consider strength (and thus PQ) linear: we can always determine the average over all values and use it as an offset.

// Sum: i over N
mean_value = (1/N)*Sum( x(i) );
variance = 1/(N-1)*Sum( (x(i)-mean_value)^2 );

 

For this distribution, you can expect 68.3% of your players to fall within the [-width, width] interval (with mean_value being zero). Make the interval twice as large, and you will cover 95.4% of the player population. There is little point in trying to cover the fringe, as [-3*width, 3*width] will only bring you up to 99.7%. The distribution is symmetric with respect to the mean value: 50% of the players will be below the average value, 50% above, 34.15% will be in the interval [-width,0] and [0,width] each, and so on.

kreimeier_06.gif

Fig 4: Probability Density Function
Normal Gaussian Distribution
with unit variance and zero mean

Decisions and Uncertainty

Our algorithm requires us to make a guess on the expected result of an encounter between players, based on their relative strength, which in turn is approximated by the current PQ's. How do we calculate this? The simplest decision function is a hard threshold:

 


  // return expectation based on PQ
if ( pq > opponent.pq ) return 1;    // I win
if ( pq < opponent.pq ) return 0;    // opponent wins
return 0.5;                          // a draw

Using the threshold function is appropriate if there is no uncertainty in the game. In the real world, we will have to deal with some uncertainty: e.g. better players will occasionally lose against weaker oppponents. Our hope is that the average will be meaningful - that, over time, true skill will show itself in the accumulated results despite random noise.

To account for this, we need a softer decision function, one that is sensitive not just to the sign, but the magnitude of the difference in PQ. The expected average result - the probability of a given player beating a given opponent - should be a continuous, differentiable function of the difference in ratings. If the difference becomes arbitrarily large, the probabilities should approach zero (weaker player wins) and one (stronger player wins), respectively. In the case of two equally rated players, the expected average outcome should be 0.5 - not a just draw, but a 50-50 chance for each player.

The kind of function we are looking for should thus have to following properties:

lim x- -infinity f(x) = 0
f(0) = 0.5
lim x- +infinity f(x) = 1

 

More, it should be symmetric around ( 0, 0.5 ), that is f(-x)-0.5 = -f(x)+0.5 or f(-x) = 1-f(x). Why? If the player trade places in our calculations, it should just swap the probabilities, and the sum of outcome probabilities has to equal unity. Finally, it should be strictly monotone: probability always increases, never decreases, with increasing difference in the two players' ratings.

As before, we choose a mathematical function well-known for its nice properties, the Sigmoid Function. This ancient beast from the dungeons of mathematics can be encountered in so diverse yet related areas as combinatorial optimization (simulated annealing), neural networks (logistic or sigmoid transfer function), statistics, and physics (e.g. as the Fermi probability distribution). We borrow from physics a generalization of the Sigmoid function, i.e. use the Fermi function. It scales the argument x by a factor of 1/T. The Fermi function describes an infinite number of functions with sigmoid shape, parametrized by T, which is commonly referred to as "temperature" parameter.

double Fermi( double x ) {
return 1.0/(1.0+exp(-x/T));
}

 

In Figure 5, Fermi(x) is plotted for different values of T. Parameter values printed are T=0.1, 0.5, 1.0, 2.0, and 5.0, colors ranging from blue (low temperature) to red (high temperature). Each curve has the distinct S-shape, but you can see how lowering T brings us closer to the threshold function. For infinitely high temperature, Fermi(x) will flatten out to 0.5 for finite x. In other words, the higher T, the less a given different in PQ will matter with respect to the outome of a game. In a way, T determines the significance of the scale we use for our PQ. But for the moment, these values are arbitrarily chose. Every positive T, every value in ]0,infinity[ will satisfy our requirements. For the time being, we will just set T = 1.

kreimeier_05.gif

Fig 5: Fermi Function for varying temperature

Estimation 101

We have defined PQ as our estimate of player strength, we have chosen an density function that we expect to describe our player population, and we have chosen a function to predict probable outcomes for a given player-player encounter. All that is left to do is to decide how we adjust PQ estimates when we learn that we are off the mark.

// result: 1 means I won, 0 means opponent won, 0.5 is draw
const T=1;
max_gain =1;
void ChangeRating( my_rating, opponent_rating, result ) {
delta_rating = opponent_rating-my_rating;
expectancy = Fermi( delta_rating, T );
prating_change = max_gain*(result-expectancy);
my_rating += rating_change;
opponent_rating -= rating_change;
}

As an example, use a one-on-one deathmatch situation. For simplicity, we use each single kill or "frag". Adjusting ratings per frag is likely to give the fastest convergence towards a good estimate, it will also work reasonably well for free-for-all deathmatch, and it avoids issues of normalizing game results (e.g. 2:1 frags vs. 20:10).

Beating your equal, you will get 50% of the maximum possible gain (0.5 out of 1). If you lose the match, your loss will be 50% of the maximum possible loss (-0.5 out of -1). Diminishing losses and gains occur for very high ranked players crushing bottom ranks, maximum gains and losses will be applied in the opposite case of a weak player beating a suppposedly much stronger one. Take a look at the expression

expected_gain = (expectancy)*(1-expectancy)

This expression resembles the derivative of the Sigmoid function. It calculates my normalized expected gain: the probability to accomplish an improvement in my rating. The first factor is simply my likelihood of winning the encounter, while the second factor is the reward based on that likelihood: the more certain my victory, the smaller the reward. The expected gain will peak for expectancy 0.5. Keeping an eye on both expected risk and expected reward, the player is guided to search out those presumed equals.

No Pain, No Gain

The "pure" algorithm presented here differs from the Elo chess rating and the IBL Bolo rating in several ways. Most notably, we have set free parameters to 1 (e.g. T) or 0 (e.g. mean) whereever possible. This might not be justified for all such parameters. Take max_gain, the maximum gain per sample/encounter. We are free to choose max_gain=1. However, T and max_gain are not independent, and both combined express another expectation about our game.

How are these values connected? In a sense, for a given value of T and a given player distribution, max_gain depends on the amount of randomness in your game. Let's take two players with equal rank, who just met for the first time, and let's assume player A beat player B. If we are now confident that player A will thusly beat B every time, max_gain should be large enough to make sure that the new ratings for A and B reflect this. In other words, 0.5*max_gain should increase the difference in rating between the two enough to make sure that the new estimate for A beating B will be sufficiently near to unity.

Yet, we do not know what difference in PQ indicates certain victory. We would need the width of the distribution to determine this. For simplicity, set max_gain to 1 for starters and see how it works out.

The Bell Tolls

It is Day One of our hypothetical Online Gaming Service, and the worldwide tournament has just started. Players have registered, and there have to be initial ratings. Usually, for purely psychological reasons, all players would get a positive, fixed rating to start with. But this is, essentially, just an offset. The bell-shaped distribution (as well as any real world population it could safely approximate) will be centered around the average strength, and whether or not we call that average zero or 10,000 is merely a matter of caressing egos - our algorithm only cares about the difference delta_rating.

In our database we set each player to zero rating on Day One. Over time, winners will increase their rating, while those who lose will end up with smaller (i.e. negative) values. As we set both T and max_gain to 1, the actual rating values (the width of the distribution) will develop based on the skill differences found among your players. If every player can beat every other player occasionally, the distribution will be comparatively narrow - all players will have ratings within the range where the Fermi function changes significantly. If, however, we can find players A,B,C, where A will always beat B and C, and B will always beat C, then the width of the distribution will be larger - if B is an average player, then A and C will have ratings in the range where the Fermi function is asymptotic - near zero or near unity.

Thus the width of the distribution we found, relative to the temperature parameter T, measures how discriminating the game is - all with respect to a given set of players and their skills. We can only measure player skill relative to other players. The more players participate, the better our model assumptions will fit. In the low and high "tails" of the distribution our model will fail - in the real world, there is a limit on how badly a player can perform, and the better a player, the more likely she will join. There are density functions describing such distributions, but in most cases, there should be little to gain from adding this complexity to your model.

Now, after a couple of months, when every player has played sufficiently often against a sufficient number of opponents, your rating database will approach some kind of equilibrium. If there is not much change in the player population, your rating distribution will give you an increasingly good approximation of the range of player skills you are dealing with. Your model will also aid you in predicting the outcome of matches. The more randomness in your game, the less reliable the prediction for a single match - averaged over a large number of encounters, however, skill difference will eventually show, and the distribution of results for each player pairing can also serve as a measure of how discriminating your game actually is. You could also use this information to adjust max_gain for better convergence.

kreimeier_06.gif

Fig 6: Probability Density Functions
for Normal Gaussian Distributions
(varying widths)


Closed System

Day One has come and gone, and your online tournament is up and running. Hopefully, more and more people will join the ongoing online battle. When assigning an initial PQ to each latecomer, you will have to keep in mind that our algorithm is designed to be a "zero sum game": one player's gain is always compensated by another player's loss. If the initial PQ value of every player is set to the average of the distribution, player migration into the system will not change the zero mean value.

The chess player association did not account for this. Consequently, when chess player numbers increased, the new participants decreased the ratings for even the top level players, a steady drain that makes comparison to over years unsatisfying guesswork. You could cheat a bit by assigning initial PQ values at random, as long as the average over all initial values is preserved. This might come handy for setting up Day One - to get things moving, it might be a smart move to scatter ratings a lot intially, randomly assigned in an pre-registration based lottery event.

There is the inverse problem: players that do not play anymore, thus effectively removing themselves (and their share of points) from the system. In a a zero sum, zero average rating system, any attempt to introduce a "rating decay" over time has to redistribute the ratings points to all other players. Paradoxically, the decay does not decrease the PQ for all players - only for half of them:

decay = 0.1; // per time period
pq -= decay*(pq-mean);

For an above average player that is not competing any longer, PQ will steadily approach the average rating, but never drop below. For a player with below average rating, however, not competing will increase the PQ till it approaches the average.

If this seems unintuitive and unacceptable to you, remember that PQ is our estimate of player ability. If we do not have any data, be it because a player has just joined, or because a player is just not playing, then our best estimate is that the player has average strength - it is the most likely result, after all. Some rating algorithms attempt to avoid this, but there is no "patch" to combine two unrelated properties (the player strength estimate, and our ccnfidence about it based on the amount of data we have accumulated) into a single number. Mixing data from different domains is difficult at best, and usually a bad idea.

Many gaming associations have adopted the Elo rating system, and often, parameters and formula are tweaked to account for various, often mutually exclusive, purposes. At its core, all rating starts with the mechanism outlined in this article. The more complications and patches you add on top of the pure solution, the less comprehensible the results.

To express a confidence in player ratings, it is better to have a second number that describes how frequently a player has participated in matches, and how many encounters are listed in our database. If necessary, rank displays can be filtered by such a "confidence level". Such a mechanism offers an incentive for participating often, and for participating over a long time, without imposing a penalty (e.g. a decay) for temporary absence.

It is inevitable that a rating changes its meaning over time. Our PQ is only defined with respect to the player population at a given time. If the community changes, so does the meaning of our PQ. Only if you take for granted that the players today are on average as capable as the players two years from now, comparing their ratings makes sense - unless you decide to measure against a bot, an artifical opponent. Needless to say that you can't change the game or the rules either.

Of course, the next problem we are facing is one of user interface: What value do we want to display, when, and how?

Signing Off

Given an ideal Gaussian distribution, you can expect to find 68.3% of the players within the [average-width, average+width] interval - only 15.8% are supposed to be better, another 15.8% not as good as the players in this interval. This gives you a practical limit on how much dispersal you could allow for while keeping the overwhelming majority of players within a certain interval.

Why would you want to do so? Here we leave the realm of mathematics and venture into the area of psychology: the impact of native [sic] numbers. Let's say the average PQ is 1500 points, and you have a width of about 500 points. In this case roughly 2.3% of the players will be end up with ratings below 500 points, but some 0.15% might rate below zero, and they will not like it. Mind you, in a pure zero sum game with that bell-shaped distribution, about half of your customers will face negative ratings.

What to do then? Remember that condoms are sold as large, extra, and gigantic, but never tiny? You should not give up the zero sum, zero average PQ used internally, but you can offset the PQ for displayed rating values by some sufficiently large number. You will probaly wind up multiplying the internal values with a large scaling factor as well.

Due to the shape of the normal distribution, any offset, no matter how large, will still cut into the low end "tail". There might always be some poor souls ending up below zero. Your game could simply map any rating below a certain positive threshold to a minimal "display" value (effectively hiding the real values), but there is an elegant alternative: fixed range rating.

Fixed Range Rating

The majority of your players will neither be very bad nor extremely good. Those who do not do well do not really want to know how far behind they are, and those who are outstanding play by different rules anyway. Both groups will have difficulties finding exact equals no matter what we do, and loss of precision will not change this. Thus, we can easily waive accuracy on the fringe (both low and high ratings), as long as our results are accurate and nearly linear where it counts - near the average, for the majority.

Remember that our rating is an estimate of the ability to succeed against another player. Our reference is a fictional "average" player. If we view the outcome for a match of a player with a given rating PQ against the player with average rating, we get the Fermi function again - a result in a limited [0,1] interval!

Voila, this is your fixed range rating. Multiply Fermi(PQ,T) with 10,000, and all your players will have ratings between zero and ten thousand. The average player will have a rating of 5,000. Inexperienced players will find themselves rated close to zero, outstanding players will approach 10,000. To apply the above formula, those players would have to know many decimals of their ratings to obtain meaningful results.

One benefit of this representation is that the outcome of a game can be calculated without using exponential functions. Given two players with ratings R1, R2 out of [0, 10,000].

r1 = R1/10000.0f;
r2 = R2/10000.0f;
// likelihood that player 1 wins
p = 1.0f/(1.0f + (r2*(1-r1))/(r1*(1-r2)));

There is one problem with this approach, however. If the width of the distribution exceeds the interval in which the Fermi function changes significantly, then a large number of players will have visible ratings close to 10,000, and an equally large number will have near zero ratings. To fix this, you have to use a different value for T - the greater the width of the distribution, the higher. If you want about 95% of your players mapped to approximately 95% of the entire range (i.e. between 250 and 9,750 points), the appropriate display scale as a function of the width would be about 0.55, or approximately 1/2.

// Width has to be calculated
// from the values in the database.
width = PlayerDB.getWidth();
// We shoot for [250, 9750]
max_rating = 10000.0f;
// Symmetry, guesstimate using Fermi(-2*width)==0.025
display_scale = 0.5*width;
// This is 1/(1+exp(-2*x/width))
display_rating = max_rating*Fermi( internal_rating, display_scale );

Using this approach, the ratings you show to the players will always be between zero and 10,000, with the average rating being 5,000, and the vast majority of your players spread out over an interval of 9,500 different values. About 68% of your players will find themselves somewhere between 1,200 and 8,800, with about 38% between 2,600 and 7,300 - you get the idea. For the majority, the rating will depend nearly linearly on skill - no matter how much dispersal your internally used PQ values exhibit. Just keep in mind that the visible rating, and the scaling used for it, are completely separate from the internally used PQ, and the max_gain and T parameters.

 

kreimeier_07.gif

Fig 7: Internet Bolo League Reating Adjustment
(with permission by Joseph Lo)


Credit Assignment

To my knowledge, the first multiplayer game that adapted a similar mechanism to calculate post-game ratings has been the Internet Bolo League, using a mechanism based on the Elo-System. Figure 7 shows the effective gain/loss calculated by the Bolo rating algorithm. The Elo-system, in turn, was devised by Arpad E. Elo for the United States Chess Foundation in 1960, and is still in use. Details have been published in The Rating of Chessplayers, Past & Present by Arpad E. Elo (ISBN 0-668-04721-6, 1978). Recently, Netgames USA adopted a similar rating scheme.

The approach described above will give you running estimates of relative player skill, with an accuracy that depends on your game, and on the number of samples you obtain. Most of your work still lies ahead, and it touches upon design issues for both user interface and game mechanics. Remember that scores define objectives - your game's rules have just been changed.

If you are heading for an persistent, authorative player database, you will obviously have a larger community to sample from, and you can obtain data from many servers over a long time period. The problem with this is verification of player identity, and cheat prevention. Persistent player ID is required for accurate ratings, just as accurate ratings are required to guide server access based on skill. The same problems will affect databases that are maintained locally on a per-server basis. Per-session rating avoids all issues of persistency, but the estimates might not be good, as we start with a blank slate in every session. Inevitably, cheating is always an issue.

Then there is the user interface. Ideally, a game would indicate both the risk and the possible gain of a potential target timely enough to aid the player in a decision. Much like in old arcade games, you could provide the likely score (gain*risk) along with the target player's name, as soon as somebody sets their sights on him. Post-frag feedback will likely not suffice, yet pre-frag indicators (crosshair color and size, warning sounds) are of little use in dense, close quarter free for all games.

Incidentally, such high pressure, high speed games exhibit a distinct randomness. Any kind of performance statistics, be it rating or simple frag counting, faces a Credit Assignment Problem here: determining whose skill was instrumental to accomplish a certain score. You might have to review your game design in an attempt to find more "atomic" events - e.g. counting applied damage instead of completed frags. Obviously, this will affect player tactics in yet another way. So would direction indicators to help players to find a match.

Tournaments consisting of a sequence of one-on-one encounters (of which there could be several executed in parallel on a single server, as long as the map areas are separate from each other) solve the Credit Assignment Problem. Using the current estimates in selecting the next pairings should offer quick convergence, which means that near equal players (if any) will meet quickly. Finally, you can attempt to add handicaps based on skill estimates. Instead of trying to balance handicaps during playtesting, the game can try to determine proper handicaps by itself - guided by skill estimates, you might find ways to cut down trial-and-error.

Rating based skill estimates are not only useful for multiplayer. Single player games traditionally put the burden of estimating their own ability on the player, but we are moving towards adaptive game logic that responds to observed player performance during the game, and adjusts challenges and supplies accordingly. At its core, any such system will have to maintain a rating of player ability.

It is usually not a good idea to combine ratings from different game modes. Teamplay has different objectives, and mixing scores can only lead to confusion. Performance statistics based on simple criteria (hits/shots, damage done vs. damage received, average lifetime) work best, and largely avoid credit assignment issues. Rating groups of players as a team is closer to the information needed to select player team pairings.

At first glance, Elo-like rating might seem much to elaborate for online shooters. Yet, this is one aspect of a game easily done right, and the benefit might be huge. Multiplayer deathmatch games have advanved far into an era of progessive refinement, with the gameplay pretty much boiled down to its essence. The shelf lifetime of future competitive multiplayer games might well depend on their ability to aid the players - both the die hard, and the casual gamer.

Bernd Kreimeier is a physicist, writer, and coder, working as senior programmer and project lead at Loki Entertainment. Previous work for Gamasutra and Game Developer Magazine include Killing Games: A Look At German Videogame Legislation as well as the "Dirty Java" features on Using the Java Native Interface and Optimizing Pure Java. He can be reached at [email protected].

Read more about:

Features

About the Author(s)

Bernd Kreimeier

Blogger

Bernd Kreimeier is a physicist, a writer, and a coder at Oddworld, but not a game designer. Come to his GDC roundtable to propose your take on game design methods. He can be reached at [email protected].

Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like