Sponsored By

Featured Blog | This community-written post highlights the best of what the game industry has to offer. Read more like it on the Game Developer Blogs.

Word Games and the myth of ETAOIN SHRDLU

While most people are probably passingly familiar with the ETAOIN SHRDLU letter frequency list, not many know its origin or its flaws for building word games. This post examines some of the alternatives for word game designers.

Steven Stadnicki, Blogger

July 2, 2012

13 Min Read

Most of us have probably come across the ‘ETAOIN SHRDLU’ mnemonic for the most common letters in English, and anyone who’s solved many cryptograms will likely be at least passingly familiar with a standard list of letter frequencies. For reference, here's the Wikipedia version of this frequency table:


E : 12.7%

I : 6.97%

R : 5.99%

U : 2.76%

G : 2.02%

V : 0.98%

Q : 0.095%

T : 9.06%

N : 6.75%

D : 4.25%

M : 2.41%

Y : 1.97%

K : 0.77%

Z : 0.074%

A : 8.17%

S : 6.33%

L : 4.02%

W : 2.36%

P : 1.93%

J : 0.15%





But if ETAOIN SHRDLU is so representative of letter frequencies, then why has Scrabble made the H worth 4 points - as many as D, L, and U combined? Why is S only the seventh most frequent letter when every Scrabble player knows how many words offer S-hooks? The short answer is that what's being measured by the usual letter frequency analysis isn’t what matters for most word game designers; for the long answer, we’ll have to dive into the dictionaries - at least a few of them.

What makes ETAOIN SHRDLU the wrong tool for the job of most word games is exactly what makes it the right tool for cryptanalysis: it measures the frequencies of the letters in English text. When you’re looking at an encoded quote this is just what you want, because you know the word ‘UIF’ that shows up in four places is more likely to be ‘THE’ than anything else. But when you’re trying to spell words on a letter grid, you’re not going to spell ‘THE’ a hundred times for every time you spell out ‘SET’, and you aren’t a thousand times more likely to know ‘WHICH’ than to know ‘SLIPS’. The letter ‘H’ gets a particularly disproportionate boost from counting words by their frequency, because ‘THE’ is the most common English word by an immense margin (more than half again as common as its nearest competition, ‘OF’) and because it shows up in so many other common pronouns and ‘glue words’: ‘THAT’, ‘HIS’, ‘HE’, ‘WITH’, ‘HAD’, and ‘HER’ are all near the top 20 words. Players will use these words, but they won’t use them every game, and they’re no more likely to know them than to know any of the other top ten to twenty thousand words; we can say that vocabulary, at least for the first several thousand words, is essentially uniform. 

So if ETAOIN SHRDLU isn’t the right tool to use for a word game, what is? If its problems are the result of weighting words by their commonality, then a natural first thought would be to go all the way in the other direction: take every word in the dictionary with equal weight. Of course, this raises the question of what dictionary should be used. Fortunately, there’s a canonical dictionary for word games: the official tournament word list for Scrabble, also known as TWL (The Word List), with the most recent version being TWL06. Tallying the letter frequencies across the full list of words in TWL06 yields the following table:
 

E : 11.5%

R : 7.10%

L : 5.34%

P : 2.94%

B : 1.90%

K : 0.91%

J : 0.17%

S : 9.48%

N : 6.74%

C : 4.05%

M : 2.83%

Y : 1.63%

W : 0.78%

Q : 0.16%

I : 8.86%

T : 6.57%

D : 3.46%

G : 2.75%

F : 1.26%

Z : 0.48%

 

 

 

 

For comparison, here are a pair of plots showing the differences in frequency between the two tables:

 

Frequency Chart of ETAOIN vs. TWL06, sorted by ETAOIN frequency

Frequency Chart of ETAOIN vs. TWL06, sorted by ETAOIN frequency

ETAOIN SHRDLU vs. TWL06, sorted by the former

 


 Frequency Chart of ETAOIN vs. TWL06, sorted by TWL06 frequency

Frequency Chart of ETAOIN vs. TWL06, sorted by TWL06 frequency

ETAOIN SHRDLU vs. TWL06, sorted by the latter


The results bear out our intuition nicely: notice how much less frequent H is in the full word list than it is in the weighed frequency chart, and how much more frequent S is. T and W are also big losers here, with W showing up only a third as often in TWL06 as it does in weighted text when it loses the boost in usage it gets from words like WHO, WHAT, WHY, and WE. The letter I also shows up much more often in TWL, possibly because of the number of verbs that can take an I as part of an ‘ING’ gerund suffix. All these changes lead to the order of commonality being markedly different for TWL than for analyzed text. 

On the other hand, while the new letter frequency list is a substantial improvement from the more traditional one, it has its own shortcomings. In particular, it overcompensates: while the classic method counts WHICH for a thousand times as much as SLIPS, using TWL06 means that STORE counts exactly as much as SCYPHOZOANS; it puts too much weight into the letters that appear more often in longer and more exotic words that aren't likely to come up regularly in play. It would be fantastic if we could find some sort of magic list of 'words that players are likely to know’, but of course not all players will share the same vocabulary.
 

Fortunately, there are many different lists of words available, sorted by the frequency of those words in the bodies of text (also known as corpora - singular corpus) that they’re derived from. I’ll focus on the two longest lists at http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists : a list of the most common words from the Open Subtitles project and a list of the most common words from Project Gutenberg. Looking at the words that appear in only one of the two lists makes their different origins obvious: the most common words from the subtitle list that don’t appear in Project Gutenberg include several obscenities along with YEAH, OKAY, MISS, and MOM. By contrast, the most common words from the Gutenberg works that don’t appear on the subtitles list are EXCLAIMED, REMARKED, PAUSED, INQUIRED, and CHIEFLY. Looking at random samples from down the long tail of  these lists shows the same thing; looking around the 15,000th word in the Gutenberg list you’ll find LOQUACIOUS, MULTIFARIOUS and SHEPHERDESS, while at roughly the same spot on the subtitles list you’ll find FINGERNAIL, MOTORCYCLES, and NIBBLE. For purposes of modeling the words that players are likely to have on the tips of their collective tongues (for instance, for a more action-oriented word game where players won’t have too much time to think of individual words) I prefer the Open Subtitles list, but a game that rewards loquaciousness, where players have more time to think over their words, might do better modeling player behavior by the Project Gutenberg corpus. Cleaning up the list by removing words with non-alphabetic characters (such as apostrophes) or that aren't legal dictionary words (for instance, proper names) and then doing a frequency analysis on the first 20,000 words of the resulting list produces this frequency table:

 

E : 11.4%

R : 7.38%

L : 5.24%

G : 2.99%

B : 1.88%

V : 1.12%

Z : 0.25%

A : 8.23%

N : 7.29%

C : 4.16%

M : 2.86%

Y : 1.73%

W : 1.06%

Q : 0.15%

I : 8.02%

T : 6.49%

D : 3.98%

P : 2.79%

F : 1.44%

J : 0.33%





So what does using the Open Subtitles list change? Here’s another chart, this time showing the differences between letter frequencies in TWL06 and in the first 20,000 words of the subtitles list:

 

Frequency Chart of TWL06 vs Subtitles, sorted by TWL06 frequency

 Frequency Chart of TWL06 vs Subtitles, sorted by TWL06 frequency

TWL06 vs. Subtitles, sorted by the former


 

Frequency Chart of TWL06 vs Subtitles, sorted by Subtitles frequency

Frequency Chart of TWL06 vs Subtitles, sorted by Subtitles frequency

TWL06 vs. Subtitles, sorted by the latter


The subtitle-based list reins back some of the more dramatic changes that using TWL06 brings: both S and I are somewhat less frequent compared to their appearances in TWL06, though both are still more frequent than they were on the traditional cryptographic frequency chart. Going the other way, D is noticeably more frequent on the subtitle frequency list, possibly because of the prevalence of past tenses in conversations. Overall, though, the two lists are clearly close—much closer than either is to ETAOIN.

So which list should you use? In the end, there's no one right answer; it depends on what words you expect your users to make. A fast-paced action word game might want to start with the subtitle-based list, while a slower game where users can search for words at their leisure might want to use the TWL06 frequencies. Even the original ETAOIN SHRDLU list might be correct for a game where players are building full sentences out of the letters and not just individual words. A game that limits the size of the words players can use might even want a different frequency list entirely, one built from only words of a specific length. And choosing an initial list isn't the end of the story, either; during playtesting you can record your players' letter frequencies, both the letters they use and the letters they leave behind, and adjust your frequencies based on the results. Still, these lists and the motivations behind them should serve as a solid starting point for anyone building a new word game.

Read more about:

Featured Blogs
Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like