So, two things:
Last week, Valve announced Steam Labs, a new initiative where Valve pulls back the curtain on various projects they're working on internally but that aren't quite ready to be rolled out publicly just yet.
Given the timing, I decided to go ahead and release a janky half-finished prototype of a little weekend project I had cooked up called Steam Diving Bell. You can play with it here. Just please don't hug my server to death.
So now that my little project is out there, I'd like to talk a bit about it and Steam Labs in general.
Diving Bell is an experiment meant to address discoverability on Steam. It serves a similar purpose to Steam Labs' Interactive Recommender, which is a really neat machine-learning based recommendation engine you can read all about here. I've tried it myself, and it really works -- it's an incredibly neat piece of tech.
So what the heck is Diving Bell, who is it for, and how is it different -- why would anyone want to use it if we already have the fancy interactive recommender?
All great questions.
What is it?
It's a (prototype) web app for quickly discovering interesting games on Steam.
Who is it for?
Anyone who wants better discovery for games on Steam. This means players (who want to find games), but also developers (who want their games to be found). But let's not forget that curators need good tools, too. Human-powerd curation stands to benefit from better tools that make it easy to quickly find the games you want to showcase and talk about.
How is it different?
Diving Bell and the Interactive Recommender take entirely opposite approaches:
Interactive Recommender uses your play history to get to know you, and uses smart algorithms to serve up games it thinks you will like. You specify a few parameters, and it shows you a list of recommendations. Interactive Recommender is like a sommelier that uses their expertise to suggest a wine that pairs well with the courses you've already chosen.
Diving Bell has no clue who you are or what you like, and uses dumb algorithms to serve up games similar to a title you specify. From there you can browse around in any direction you want. Diving Bell, like its namesake, is a vessel that lets you safely descend into the murky depths to catch glimpses of weird and interesting
The aesthetic I'm going for is "wikipedia binge." You start with some topic, then you click on links within that topic that seem interesting, and before you know it you find yourself following some totally weird but fascinating bunny trail you never expected you'd go down.
Let's start with a guided tour. You can tell Diving Bell to start with a specific game by adding "?appid=XYZ" at the end of the URL (sans quotes), where XYZ is a specific game's Steam application id. Let's start this plunge with Chrono Trigger:
Into the Depths
(There'll be a brief pause at the beginning while it bootstraps and then all subsequent loads should be faster).
Chrono Trigger is our selected game. Diving Bell serves up 8 games that it thinks are similar. There's an information panel (cropped from the screenshot for space purposes) that tells us more about the game, and includes screenshots, trailer, etc, and then there's some navigation on the bottom of the main panel: "Back", "More", and some mysterious blue buttons.
Clicking "More" will serve up another 8 recommendations, while keeping Chrono Trigger centered. At that point clicking "Back" will take us back a step and show us the previous recommendations. As for the blue buttons, these represent recommendation engines and can be individually toggled on and off. Right now all four are selected, and each corresponds to two of the currently visible recommendation results. I'll explain each of the recommendation engines with illustrations below. First, let's turn all four of them off:
These are "Default matches", and they should feel familiar if you've visted Chrono Trigger's Steam page, because I got them by scraping Steam's "More Like This" section.
For every game on Steam, there is a "More Like This" page, and it has exactly 12 games. The explanation Steam offers for how it makes these matches is:
"The tags customers have most frequently applied to CHRONO TRIGGER® have also been applied to these products"
...but I just treat it as a black box. The matches are solid, but tend to be familiar games that are already popular.
The first iteration of Diving Bell used nothing but "More Like This" matches for each game, because the first issue I was attacking was a UX problem: Let's say you want to browse more games like Chrono Trigger, then browse more games like those games, then visit one of those games' store page.
Here's how you do that currently:
- Visit Chrono Trigger's Steam Page.
- Scroll down way below the fold to "More Like This" and click a tiny button that says "See All"
- The page reloads.
- Find a game you seem interested in (Grandia II?) and click it.
- The page reloads.
- Scroll all the way down to "More Like This" and click "See All"
- The page reloads.
- Find a game you seem interested in (The Legend of Heroes: Trails in the Sky?) and click it.
- The page reloads.
That's 4 clicks, 4 full page reloads, and 2 scrolls (4 scrolls if you click a recommendation in the bottom row on the "More Like This" page). In Diving Bell, this same journey takes 2 clicks, zero page reloads, and zero scrolls. Granted, my prototype web app is a total potato and the async requests take longer than I'd like to fill in, but with a real database and some optimization there's no reason those couldn't be nearly instant.
Just by changing the UX I think we've already improved on the browsing experience of finding more games like Chrono Trigger. But there's a problem: the default "More Like This" recommendations are a bit too good.
"Too good?" What? How could that be a problem?
Obviously I'm using "good" a bit facetiously, what I really mean is they're too on-the-nose for the browsing experience I have in mind.
Take a look at Chrono Trigger's 12 default recommendations:
Now compare those to Grandia II's:
There's a ton of overlap, which means even with the improved UX you'll constantly loop back onto things you've already seen, and never get too far from the original game's center of gravity. Maybe that's what some people want (and it should certainly be an option) but a discovery tool meant for general use can do better.
Maybe we can cull results we've already seen? That could work, but we still only have 12 recommendations for each game, and with this much overlap we'll hit dead ends in no time. We need a way to expand the pool.
Here's an idea -- we have this cool network of game connections from these "More Like This" pages, but what if we reverse the direction of the matches?
We've already established that every game on Steam points to 12 other games in its "More Like This" section. But what if instead of looking for the 12 games pointed to by Chrono Trigger, we crawl every single game on Steam and see how many games themselves point to Chrono Trigger as one of their 12 games? Let's call that a "reverse match."
Now instead of 12 games, we have hundreds or more. Now we have to sort them so we can decide which 8 to show first. I went with a tag similarity heuristic which I'll describe later, but all the results are viewable -- the user can click "more" to see the next 8 until all the reverse matches are exhausted.
Whereas default matches favor genre kings, reverse matches favor niche games. That's because every game in a genre tends to point to the genre kings, but the genre kings don't point back to the niche games. This recommender flips that dynamic.
We see Grandia II and The Legend of Heroes: Trails in the Sky (themselves somewhat niche cult classic when compared to Chrono Trigger), but we also seem some great well regarded indie titles like Cthulhu Saves the World, Cosmic Star Heroine, and Epic Battle Fantasy 4. This gives us a much broader network to crawl -- wikipedia binge here we come! Let's click on "Cosmic Star Heroine" and see where that tackes up.
Hmm, here's a problem. Cosmic Star Heroine, despite being a great game with a lot of similarities to Chrono Trigger, only has seven reverse matches. This is because most similar games have already spent their 12 slots on genre kings. Diving Bell will fill in the gaps with Default recommendations, but we still need more fodder for general browsing.
This is where LOOSE matches come in.
Loose matches crawl the "more like this" graph for the selected game twice. We get the 12 default matches, and then we grab each of those games' 12 recommendations for a final list of 144 matches. Then we exclude the original default 12 matches from the results as well as any duplicates. This gives us a list of games that are still pretty similar to the selected game, while adding just enough noise to juice the variety a bit.
Loose seems more middle-of-the-road than Default and Reverse: it gets a good number of matches, but it doesn't exclusively favor big games, nor does it dig too deep to shine light on niche ones.
The amount of games it returns varies too. For indie titles, it returns a lot --
we'll have more than enough titles here for Cosmic Star Heroine. But let's go back to Chrono Trigger for a second:
This is really interesting. Chrono Trigger is only able to give us six unique loose matches! Now there's a good chance this is just a stupid bug, but I also suspect this is at least in part because of how self-referential the "more like this" network is for genre kings. The 12 default matches reference each other to such a strong degree that even after you generate a pool of 144 second-degree matches, you only have 6 unique matches once you've excluded the default 12 and any duplicates. And even if these particular results are just down to a bug, we know from before that there's tons of overlap in big games' loose matches, and therefore less results over all.
This underscores the need for a variety of recommender engines. Each one so far has a different natural strength:
- Default: small number of matches no matter what, favors genre kings
- Reverse: big game = many matches, niche game = few matches, favors niche
- Loose: big game = few matches, niche game = many matches, neutral(ish)
These three recommendation engines alone probably provide enough variety, texture, and depth to the network to give us that wikipedia binge feel we're after. But we're not done yet! There's room for more.
At this point we're leaving the "More Like This" results entirely behind and will generate new recommendation systems from scratch.
This returns 8 games that Diving Bell considers to be similar to the selected title based entirely on their tags. This tends to favor niche games over popular and well rated ones because the only thing it looks at is the tags.
Some context: every game on Steam has a series of tags that describe different aspects of a game. There's some tags for genre like "RPG" and "Action" and "Platformer", some things that seem to describe visuals like "2D", "Pixel Graphics", and even "Beautiful", as well as random nouns and adjectives like "Werewolves" and "Psychedelic." You can see a complete list here.
Tags are a pretty messy system and the first iteration of my tag-based recommendation engine returned awful results. After a few tweaks, I settled on a decent approach and made it completely transparent to the user. Just hover over any game matched by tags and you'll see a breakdown of how it calculates the score.
What I did here was to take all the Steam tags and group them into various categories (RPG and Adventure go under "Genre", Sci-fi and Retro go under "Theme", JRPG under "Subgenre" and so forth). Then when matching games I go through each category and count how many tags the second game has in common with the first in that category. Then I multiply that number by a list of weights -- for instance, I consider a subgenre match more important than a genre match, and the viewpoint and visual categories more important than the "misc" category. Then I add up all those scores and divide by a theoretical perfect score (where every category matches perfectly) to get a percentage.
This classification scheme is completely arbitrary and reflects my own subjective biases about what matters, but it seems to do the trick. I suspect that the mere act of breaking things down and applying some weights is more important than the exact set of categories and weights you choose -- just anything to get you away from comparing two naked lists of tags in a naive way.
This recommender can be hit and miss, dependent as it is on the notoriously mixed quality of the tags placed on any given game. But this recommender is still capable of producing some really solid matches:
I'm torn on whether to actually display the X% match scores on tag results (or just use them internally for ranking), but I think it needs to stay in some form because this matching mode returns a lot of results. It starts by taking a subset of games that have at least one matching tag in a major category and then ranks them all. This can potentially return hundreds or even thousands of results, and after several pages in you're going to get some really weird stuff that's not similar at all. I could just hard cull results below some score threshold, but I prefer to let the user keep exploring and just give them accurate information about how sloppy the current results are. One thing I think I'll change based on feedback is the exact number % I display. Although 68% is a pretty good match score, school has trained us to read this as a "failing" grade, so I might artificially inflate all the scores to compensate.
In summary -- tags don't care about bigness or popularity, they only care about similarity (as defined by tags). Not all games are well-tagged, and the matches can be noisy. But Diving Bell thrives on noisy results, so this is fine!
But there's still room for at least one more recommendation engine.
Hidden Gem Matches
This is my favorite recommendation engine. You might have seen Steam250.com's list of Hidden Gems, or read my article from five years ago proposing such a system. In either case, the idea's the same -- you find games that a) have a low # of total user reviews and b) have a very high user rating. Then, you rank them by a sensible algorithm, adding a penalty to anything with too many user reviews total. What you're left with is a list of extremely well regarded games that haven't gotten much attention -- ie, "hidden gems."
Diving Bell's "Hidden Gem" recommender is derived from the tag recommender, but instead of starting with a pool of games that is basically everything on Steam, I tell it to only consider the top slice of a "hidden gems list." Then I rank the results by their tag similarity to the selected game.
The results are the least on-the-nose matches of the four recommenders, but often the most surprising and delightful. They're at least vaguely similar to the selected game, usually in the same or adjacent genre, and guaranteed to be well regarded titles most people haven't heard of yet. If you like playing cool obscure stuff that hasn't gone big yet, this is the tool for you.
Because this is a derivative of the tag recommender, it shows the same tooltip and score %, which I think is probably the wrong decision. I think it's okay to show the breakdown, but hidden gems by their very nature are going to get lower tag % scores than pure tag matches. I'll probably either remove the % score heading for gems entirely (but keep the tooltip breakdown), or else give hidden gems a bump in their score based on their hidden gem ranking, so they can compete on the same level as tag-based matches. I dunno, we'll see.
Putting it All Together
Okay, turning all four recommenders back on, this is what we see:
The reverse matches and loose matches give us a mix of niche and well known RPGs, and all Japanese to boot -- just like Chrono Trigger. The tag and gem matches give us a mix of Japanese, American, and European indie titles. Clicking more will let us dive deeper into results from our current position, and clicking any specific game will let us branch out in a new direction. Whether we explore broadly by clicking a new game and going off on a bunny trail, or deeply by clicking "more" to page through the current matches, we're sure to find something interesting.
Okay, our crappy prototype is done! Now, let's consider whether it can be gamed, and evaluate its strengths and weaknesses in comparison to the Interactive Recommender.
Can it be gamed?
Possibly. The developer has no direct control over their "more like this" matches, but they do have control over the initial set of tags they put on their game at launch, which directly affects the "Tags" recommender results and indirectly derives "more like this" matches which drive Default, Loose, and Reverse matches. It's possible to pick out some specific super popular game and then give your game the exact same set of tags, so that it shows up as a 100% match. The risk is that if the chosen tags aren't accurate, players who feel misled could refund the game and leave negative reviews. Also, once a game has been out for a while, players will apply their own tags that eventually outweigh the developers'.
If this becomes a problem where everyone pretends to have tags exactly matching Dark Souls, destroying the usefulness of the "Tags" recommender, I'll probably have to add some other heuristic to how I rank tag matches, either throwing in some randomness, or applying a small penalty to games that are on-the-nose matches but have only developer-set tags, similar to how SteamDB applies uncertainty to user rating rankings. Or I could factor in user ratings a bit. But that's another can of worms.
Another way bad actors can try to game the system is by forging user reviews to get on the Hidden Gems list (or any other recommender that cares about user ratings). Steam has put some effort into combatting forged user reviews, but it's a neverending game of cat-and-mouse. Chief among their efforts is the fact that they don't use user reviews as a significant internal signal for surfacing games. In short, even if your game's user rating is super high, it doesn't vault you to the front page the way it might on Amazon or Yelp, where review fraud is rampant. Instead -- and I have this directly from the mouth of Alden Kroll at Valve -- the only value a user rating has in algorithmic discovery is whether a game's rating is positive or not. All positive games get the same lift, all non-positive games don't. That's it. (Incidentally, this is how the current version of Diving Bell uses user ratings for all recommenders except for Hidden Gems: I exclude poorly rated games below a certain threshold from consideration so that it doesn't take forever to generate results).
Now, even without an explicit algorithmic boost there is a concrete benefit to higher ratings because humans who see "Overwhelmingly positive" vs. merely "Positive", are more likely to click the former. This likely has knock on effects on other metrics that The Algorithm(TM) does care about. But the point is that the system doesn't directly care about user ratings, avoiding a direct casual financial incentive to game user ratings. Diving Bell would create a direct relationship between higher user ratings and increased visibility, at least in the case of Hidden Gems. So there's a risk that recommender could run afoul of Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure."
The hope is that both user rating and tags fraud could be met with counter-measures, and subject to self-correcting forces. When you get increased visibility you get more sales, but you also attract more feedback in the form of user tags and user ratings. And it won't take much to sink a counterfeit gem or to to strip off deliberately misleading tags. And if Valve can prove you commited fraud, you risk losing your developer account (though that must be balanced against the risk of dropping the ban hammer on a false positive).
There's no way to know for sure until something like this is deployed at scale. The good news is there's plenty of room for tweaking not just the recommendation modules, but also ranking heuristics and filters that directly mitigate the efforts of trolls. Additionally, a human content review team could use a special version of Diving Bell with custom filters to quicly find possible cases of abuse, investigate them, and respond accordingly.
With that out of the way, let's assess the more mundane strengths and weaknesses of Diving Bell.
Diving Bell's chief weakness is that it's a dumb, slow, prototype. It also probably won't hold up if a lot of people try using it at once. That's not a fundamental flaw, and a more robust implementation could easily address these issues.
On the design side, it needs basic filtering and options to make sure to include X, exclude Y, or avoid stumbling across NSFW content. And it should be easier to specifcy a specific title as the starting point.
Implementation details aside, Diving Bell is never going to be as good as the Interactive Recommender at immediately serving up a dozen great games you want to play right now. This is more of an exploration tool to find things you didn't even know you were interested in.
Diving Bell is highly reliant on extrinsic metadata like tags and to a lesser extent user ratings, whereas Interactive Recommender only cares about player behavior. Diving Bell needs user ratings to define Hidden Gems, and needs tags for everything else. The "Tags" and "Gems" recommenders consume tags directly, but "Default", "Reverse" and "Loose" need them too, because the "more like this" recommendations on Steam are ultimately derived from tags. This requires developers to optimize their store metadata in order to be detected, and opens a possible vector for abuse as described above.
Diving Bell's chief strength is its dumbness. It's dead simple, transparent, and predictable, but the results are delightful and surprising. You have complete control over the recommendation systems you want to use, and there's no mystery for why you're getting any particular results. The app doesn't try to pigeonhole you based on previous play history or purchasing habits -- given the same inputs, two different users will get the exact same results. You just give it a game and it tells you what games are similar to that one, and off you go.
Diving Bell is also modular by design. If we come up with better recommendation engines (perhaps even a derivative of the Interactive Recommender's results), there's no reason that couldn't be slotted right in alongside the others.
Also, Diving Bell might have a slight edge on the Interactive Recommender when it comes to the "cold start" problem. From Steam's own blog post on that subject:
New games in a system such as this one have a chicken-and-egg effect known as the "cold start" problem. The model can't recommend games that don't have players yet, because it has no data about them. It can react quite quickly, and when re-trained it picks up on new releases with just a few days of data. That said, it can't fill the role played by the Discovery Queue in surfacing brand new content, and so we view this tool to be additive to existing mechanisms rather than a replacement for them.
Let's see how Diving Bell fares with a quick test. Here's a brand new game called "Break the Game" that just released on Steam Today -- seems like some indie horror title -- and here's a snapshot of how its "More Like This" page looks today, the very same day it launched, July 15, 2019:
Despite the fact that the game probably launched a few hours ago and has no user ratings yet, the developer was savvy enough to fill out a full list of tags for it:
This means that the game should immediately show up in three out of the five recommendation systems that Diving Bell uses -- Loose, Reverse, and Tags. Or, it will in a final version of the app -- you definitely won't find it in the prototype version of Diving Bell, so don't bother looking. That's because the prototype uses static files scraped from Steam several weeks ago and I don't yet have a script running on a server to update those results every day. But I'm confident even brand new titles like "Break the Game" will show up somewhere in the network if I did.
This game already has 12 "more like this" links, which means a finished Diving Bell app would return "Default", "Loose" and "Reverse" results, and it has a full set of tags, which give it a fair shot of matching all sorts of other games on Steam based purely those tags. And if it manages to attract a niche audience with a high rating, it might even become a hidden gem.
This does mean that it's on the developer to properly tag their game prior to release, but I think most devs would agree that's a fair price to pay for a shot at improved visibility. Come to think of it, this might even suggest another recommender, or at least some kind of filter for Diving Bell -- show me all new and recent games, and rank them by similarity (or whatever) to some other game I already like. That could do wonders for discovery.
I think a tool like Diving Bell could make a great complement to the Interactive Recommender -- they take completely opposite approaches and are suited to different tasks and moods, but taken together they make bold steps to improve discovery on Steam for players, curators, and developers alike.
So Valve, if you're reading this, I'd be more than happy to make this one of the next experiments in Steam Labs :)