Book Launch: Data Science in Production

Sponsored By

Programming

Recommendations for building data products based on a decade of game analytics.

January 6, 2020

8 Min Read

One of the biggest trends in the gaming industry over the past decade was the rise of data and analytics in game development and live operations. While some MMOs where collecting game telemetry prior to 2010 and a few academics where exploring this space, Georg Zoeller’s presentation at GDC 2010 really set the stage for the shift in the industry that would take place over the next 10 years. Many gaming companies are now building games as live services and analytics is critical to managing these operations.

Towards the end of the decade, data science began to play a larger role at many game publishers. Rather than focusing only on product data science, where the goal is improve metrics in a specific game, applied data science is becoming a separate discipline, where teams are responsible for building production-grade data products. Many mobile publishers now have automated workflows for user acquisition and ad mediation that are operated by applied scientists and engineering teams. Applied science is at the intersection of machine learning engineering and data science, and this job title is now being used at companies such as Twitch and Microsoft.

Over the past 6 months, I authored and then self-published a book on data science with a focus on helping readers learn how to build production-grade data products, such as the ad platforms being deployed at game companies. While the book is not specific to the gaming industry, it’s based on my decade of experience in this industry and it does provide examples for common use cases in mobile games, such as predicting if a user will make a purchase. The intended audience is analytics and data science practitioners and students that are looking to get hands-on experience with using cloud technologies.

Another trend in the game industry over the past 10 years has been the continued rise of indie development, where new tools and platforms are opening up game creation to more and more developers. This trend has also been reflected in the book publishing world, where new platforms such as Leanpub and Kindle Direct are enabling authors to self-publish titles rather than working through traditional publishers. Leanpub is similar to releasing an early access game, where you can publish early drafts to your readers and integrate feedback and corrections from your audience. While the full versions of the text are behind paywalls, chapter excerpts for the book are available on Medium, code listings are available on GitHub, and a book sample is available on Leanpub. I’ll use the Five Ws and How method to describe the book in more detail.

Who

Who is the intended audience of this book?

This book is intended for data science students and practitioners that want to build experience learning tools that can scale to large-scale deployments in the cloud. It is meant as a Data Science 201 book that builds upon many of the great texts on the Python ecosystem including the works on Pandas, scikit-learn, and Keras. It focuses on applied examples for scaling up simple models to distributed, fault-tolerant, and streaming deployments. Given the breadth of topics covered, it provides a taste of different approaches for building pipelines, rather than covering specific tools in depth.

When

When will the book be available?

It is available now! The digital version of the book was published to Leanpub on December 31, 2019 and the paperback version was published to Amazon on January 1, 2019.

Where

Where can I find the book and supporting materials?

The book is available in digital and paperback formats. Please reach out to me if you’re interested in other formats.

Digital (PDF): Leanpub
Color Paperback: Amazon
Kindle Edition: Amazon
Code Examples: GitHub
Sample (PDF): GitHub

The GitHub repository contains both Jupyter notebooks and Python scripts for the examples covered in the book.

Why

Why did I write a technical book on Python?

Over the past decade I’ve interviewed over 100 data science candidates while working at multiple game companies. One of the gaps I’ve seen in data science portfolios is applied experience working with cloud platforms such as AWS an GCP. While companies often restrict access to these platforms for data science and analytics teams, knowledge of these tools is becoming a prerequisite for roles focused on building large-scale data products. At ODSC West 2019, I talked about building a portfolio for applied scientist roles, where you need to demonstrate experience with large-scale data sets, cloud computing environments, and end-to-end pipelines. This book provided me with the opportunity to demonstrate expertise with many of the tools I’ve been advocating for aspiring data scientists to learn. I also wanted to author a book as a way of building my applied science portfolio.

What

What are the contents of the book?

Here are the topics covered in this book, with linked chapter excerpts:

Introduction: This chapter will motivate the use of Python and discuss the discipline of applied data science, present the data sets, models, and cloud environments used throughout the book, and provide an overview of automated feature engineering.
Models as Web Endpoints: This chapter shows how to use web endpoints for consuming data and hosting machine learning models as endpoints using the Flask and Gunicorn libraries. We’ll start with scikit-learn models and also set up a deep learning endpoint with Keras.
Models as Serverless Functions: This chapter will build upon the previous chapter and show how to set up model endpoints as serverless functions using AWS Lambda and GCP Cloud Functions.
Containers for Reproducible Models: This chapter will show how to use containers for deploying models with Docker. We’ll also explore scaling up with ECS and Kubernetes, and building web applications with Plotly Dash.
Workflow Tools for Model Pipelines: This chapter focuses on scheduling automated workflows using Apache Airflow. We’ll set up a model that pulls data from BigQuery, applies a model, and saves the results.
PySpark for Batch Modeling: This chapter will introduce readers to PySpark using the community edition of Databricks. We’ll build a batch model pipeline that pulls data from a data lake, generates features, applies a model, and stores the results to a No SQL database.
Cloud Dataflow for Batch Modeling: This chapter will introduce the core components of Cloud Dataflow and implement a batch model pipeline for reading data from BigQuery, applying an ML model, and saving the results to Cloud Datastore.
Streaming Model Workflows: This chapter will introduce readers to Kafka and PubSub for streaming messages in a cloud environment. After working through this material, readers will learn how to use these message brokers to create streaming model pipelines with PySpark and Dataflow that provide near real-time predictions.

How

How did I author and publish the book?

For this book, I used the Leanpub platform to share drafts of the book as I authored the text. I was hoping that using this open approach to writing would help readers provide feedback on the text as it was being developed, but in general this did not provide much direction. This platform is useful for determining the size of your audience as you publish early versions, but I did not reach a scale where the book forum provided useful editorial feedback.

I used the bookdown package to translate the content authored in markdown into a nicely formatted output. Packages like bookdown are making it much more practical to self-publish books, because much of the typesetting is handled by tools rather than requiring manual effort. When writing a full text with bookdown, you may occasionally need to leverage raw Latex commands, such as specifying margin details, but the capability that this library provides is well worth learning how to handle the edge cases.

From an authoring perspective, I started by flushing out a detailed table of contents before writing any text. I identified the concepts and tools that I wanted to cover, and then wrote the book in a linear fashion. My general approach was to write a chapter at a time, where I started by writing the code segments that I wanted to cover and then authoring the chapter content.

Thank you for reading to the end, I hope you find the contents of this text useful for building out a data science portfolio! Ben Weber is a distinguished data scientist at Zynga. We are hiring!

About the Author(s)

Ben Weber

Blogger

See more from Ben Weber

Related Topics

Related Topics

Recent in More

Related Topics

Related Topics

Who

When

Where

Why

What

How

About the Author(s)

Latest News

Trending

Featured Blogs