After five years of working on different data teams in the game industry, I recently decided to join a startup in the finance industry. While many data teams were allocated large amounts of resources, it was usually challenging to actually put findings into production and have customer facing impact. Based on my experience at Twitch, Electronic Arts, and Daybreak Games, I have the following recommendations for leaders of data science teams:
- Set realistic expectations for the Data Scientist role
- Provide a forum for sharing results
- Define a process for productizing models
- Encourage data scientists to submit PRs
- Avoid Death by MBR
- Provide a roadmap for how to level-up as an IC
- Setup tooling for reproducible research
I’ll discuss each of these points in more detail below and highlight examples from my experience.
Set realistic expectations for the Data Scientist role
It’s important to set clear expectations for the types of projects and day-to-day work that a data scientist will be performing on your team. Data science is a broadly used term and it’s important to define the role clearly to candidates, ideally during the first phone interview. It’s okay to be a bit forward looking when talking about upcoming projects, but it’s also necessary to be realistic about the type of work that your team performs today. If the majority of your team’s time is spent on tasks such as dashboarding, data engineering, and writing tracking specs, then there may be little time left over for work on modeling and building data products. If you sell a candidate on interesting modeling work for a role and most of their time is spent on other work, you may have trouble retaining data scientists.
The best example I saw of this in the game industry was when a candidate was interviewing for a role on the data science team at Electronic Arts, which focuses on central projects rather than partnering with game studios. The candidate was more interested in embedding with a game team and based on this feedback was hired onto one of the embedded analytics teams.
Provide a forum for sharing results
Data scientists should have a platform for sharing their work, both within the company and externally. Ideally, data scientists should be working closely with a product team and providing updates to the leads on the product team on a frequent basis. It’s also useful to have channels for the data science team to broadcast their work to a wider audience.
At Twitch we accomplished this by writing long-form reports and sending out the reports to a broad email distribution list. This approach worked well, but required research reports to meet a high bar before being shared and not all relevant parties were on the distribution list. At Electronic Arts, cross team sharing of results was usually accomplished through meetings and internal events, such as EA Dev Con. All of the game companies I worked at were open to presenting work externally at conferences, such as GDC and userR! My recommendation is for data science teams to have a platform for sharing results that doesn’t require meetings, and doesn’t take the form of a deck.
Define a process for productizing models
One of the biggest challenges I faced as a data scientist in the gaming industry was getting predictive models put into production. It may be possible to build a churn prediction model for a game within a relatively short time frame, but take months of coordination with multiple teams to make this model customer facing, such as sending out emails to users that are likely to churn. I previously blogged about some of the approaches the science team at Twitch has applied to productize models, but each of these projects were one-offs and there was no blueprint for how a new data scientist would go about working with a product team to scale up a model. Productizing models has been problematic for most teams I worked on, because the data science team usually had no ownership of the data pipeline, and often didn’t have access to many components within this pipeline.
To improve this process, data science teams should define a clear process for how to hand-off or own a model in production, so that new data scientists don’t have to define their own approach. It’s also helpful for data scientists to work closely with engineering teams and have more involvement in authoring production code. The best experience I had with productizing models was working at Daybreak Games, where I was responsible for owning a churn model for DC Universe Online, and helped to design the recommendation system for the marketplace in EverQuest Landmark.
Encourage data scientists to submit PRs
Related to the last point, data scientists should be more hands-on when putting models into production. Instead of handing off R or Python code to an engineering team that needs to translate it to Java or Go, data scientists should do more of the work needed to get models running on production systems. In my current role, I’m responsible for translating my R scripts into code that runs as part of our Dataflow process. I submit pull requests which are reviewed by our engineering team and then deployed to production. Another approach that can be used is having an intermediate model format, such as PMML, which simplifies the process of translating models between different languages.
Most data science teams I’ve been a part of have used version control for backing up work and encouraging coordination among data scientists. However, I rarely had access to the engineering team’s repositories. There were a few occasions when data scientists submitted PRs at Twitch, but it wasn’t considered a core function of the role. This recommendation doesn’t apply to all data science roles and teams, but is something I would encourage for teams that are building data products.
Avoid Death by MBR
Another challenge data science teams face is being overwhelmed by monthly business reviews (MBR), or other regularly scheduled meetings to review metrics. If you don’t have a business intelligence or analytics team, then a lot of this responsibility may fall onto the data science team. I find that it’s usually good to have data scientists involved in defining cross-company business metrics, but quickly becomes overwhelming if each business unit requires custom metrics for tracking results. Using automated reporting helps address this problem, but if the metrics are constantly changing or data science is expected to write narratives for MBRs, this can become a substantial time sink.
The best approach I’ve seen taken for MBRs was at Daybreak Games. We settled on a handful of metrics to track across our titles, automated the reporting in Tableau, viewed the reports directly rather than copying them over to PowerPoint for meetings, and made game producers responsible for writing narratives about why the metrics moved up or down.
Provide a roadmap for how to level-up as an IC
If you claim to have a dual career ladder where a data scientist can continue to advance their career as an individual contribution (IC), then you should provide a tangible path for getting promoted. Having high-level ICs on the team is a good indicator that you can continue progressing your career without taking on a management role, but there should be a clear set of criteria that employees can focus on in order to get promoted. One way of making this path possible is to identify ways that individual contributors can grow their influence outside of their direct team.
Electronic Arts was the best example of this that I experienced in the game industry. The analytics team had a job-family matrix that identified the criteria that needed to be met for the different analytics roles and to level-up within a role. EA also hosted several events where data scientists could share work, and build influence with teams outside of their business unit.
Setup tooling for reproducible research
Data science leaders should define a process and a shared set of tools so that work that is done by a data scientist can be reproduced at a later date, and by other team members. One of the challenges in achieving this goal is that different team members may use different scripting languages and work in different environments. My current approach for addressing this problem is set up Jupyter notebooks with support for the R kernel, since that covers my current team’s scripting needs. However, this doesn’t directly solve the issues of credential management and storing intermediate data results.
At EA we worked towards this goal by authoring an internal R package that provided utility functions for getting and storing data across our different data warehouses. We also set up shared Zeppelin notebooks for data scientists that wanted to author Scala scripts. My recommendation is to use a notebook environment, author libraries for common data tasks, use tools that provide credential management (e.g. Secret), and provide a staging area for storing intermediate data results from scripts. Also, tools such as Airbnb’s knowledge repo are useful in accomplishing this goal.
While most data science teams in the game industry are able to check off a few of these boxes, I would expect that most highly effective teams already follow the majority of these guidelines.
Originally posted on medium.