Author: Raphael Schönenberger, Data Scientist@Adtrac
In my last two blog posts Audience Prediction in Digital-out-of-Home (DooH)-marketing and Get your audience data using Google Cloud Platform or how we process more than 10M records/day in a start-up company, I outlined what audience prediction is and how it can help Digital-out-of-Home (DooH)-media owners create smart products that meet their customers'<> needs. In a second step we noted that audience prediction usually requires large data about its past audiences. This data needs to be imported first. Therefore, I shed light on the import pipeline we use at Adtrac and showed that even start-up companies and small businesses can process millions of records per day with relative ease using a cloud infrastructure. At Adtrac, we therefore chose Google Cloud Platform (GCP).
In this blog post, I’d like to focus on providing actual predictions, in our case using machine learning (ML). However, building a production-scale ML model requires a similar implementation as building import pipelines. Both must be scalable and maintainable. Therefore, I’ll describe how we implemented the ML process at Adtrac and which frameworks and tools we use.
Machine Learning in a Nutshell
If you are not familiar with machine learning, don’t panic. This chapter is for you. Therefore, let’s jump right into the topic.
In machine learning, especially in supervised machine learning, the goal is to predict a label based on an input signal. This label might be a category like “cat” or just a plain number. To achieve this goal, an algorithm is trained and adjusted with the help of data on which the label is already known. This is called training.
Usually, training a model consists of two parts; Feature Engineering and Training.
In the Feature Engineering the incoming data is processed such that the later algorithm can work with it. This includes selecting and adding new features or removing unnecessary attributes from the data. Moreover, the data is split into two sets, Train and Eval, which later helps detecting biases in the trained algorithm.
Now that the data has been transformed and cleaned up, it is passed on to the next step; the Training.
During the Training, a predefined algorithm (e.g., a Neural Network) is automatically adapted by looking at the incoming data and its associated labels. Afterwards the adapted algorithm is stored.
Having now a trained algorithm doesn’t mean we are finished. We also want to make use of the model. This is called Serving. In the Serving, we take new data for which we want to make a prediction, run it through the same transformation as in Feature Engineering, and then apply the stored algorithm on the transformed data. This then hopefully leads to an accurate prediction.
Training a model
Now that we understand the basics of supervised machine learning, let’s take a closer look at how we implemented this process at Adtrac.
First of all, it’s important to realize that on a production scale we are dealing with hundreds of thousands of past audience records, which needs to be transformed first and then trained on. Such large data requires specialized equipment. Similar to the import pipelines this can be challenging, especially for start-up companies or small businesses. Using a cloud infrastructure, Google Cloud Platform (GCP) in our case, helps to overcome these difficulties.
Feature Engineering with Dataflow and Tensorflow-Transform
As we saw in the last blog post, our past audience data resides in Google’s data warehouse BigQuery. Such data is stored like a table. Each record is a row in the table and the columns represent its original features. A pipeline built on Google’s managed Apache Beam framework (Dataflow) takes such records and applies the Feature Engineering on each record.
In the Feature Engineering itself, we remove unnecessary attributes, modify others and create brand new ones. For all these transformations, we make use of Tensorflow-Transform, a framework for preprocessing training data, which can be built right upon the Dataflow-workflow. Its rule set of transformations is then stored on a GCP-storage bucket. Moreover, finally we split the data into two sets, Train and Eval, and store them in the same bucket.
Since all of this is done with Dataflow, it’s fortunate that Dataflow scales automatically depending on the number of records necessary to process. Thus, it can deal with large amounts of data.
In general, the Feature Engineering is heavily dependent on domain-specific knowledge.
Run Training on AI-Platform
For the Training, we use GCPs AI-Platform (see here), on which you either can run Jupyter-notebooks or, as we do, submit jobs based on Python-files. It also allows you to select the computation resources you need including GPUs and TPUs.
To provide you with some more technical insights, we use Tensorflow for creating the model. Tensorflow, on the other hand, is one of the most popular Machine Learning frameworks and integrates well with the AI-Platform. The model itself is packaged into a Tensorflow-Estimator, which is able to cope with large data. Though, the underlying architecture of the neural network is written in Keras.
Our so defined model takes the transformed data from the Train-set and starts adapting its parameters. From time to time it checks the performance based on the Eval-set. This helps prevent biases in the model.
After finishing thousands of such training circles, the final state of the model is exported into a Cloud Storage bucket, and the Training is complete.
Serving a model on a production scale
If an extensive evaluation shows that the trained model is accurate, we can continue to bring the model into production. In several cases that requires embedding the model into an API-like service, which can be triggered with new data. In our own case, this approach is not feasible, because we need to predict the audiences for thousands of points in the future at basically the same time. Hence, we need to run a precalculation, which predicts all necessary audiences in the future. So what does that look like?
We store all interesting points in time, for which we want a prediction for in a fresh BigQuery-Table. These raw records are of the same form as our original training input just without the label.
A second Dataflow-job is defined. This job receives as input the data as well as the stored model and the Tensorflow-Transform rule set. Equipped with the Tensorflow-Transform rule set, Dataflow applies the same transformation as in the Feature Engineering. After that, it predicts the labels, respectively the audiences, using the model. In the end, the predictions are stored in BigQuery, ready to be used further.
Puh… This was exhausting. Thus, let’s recap what we have covered.
Predicting audiences on a production scale requires transforming hundreds of thousands records about past audiences. Hence, it comes handy that we can use Dataflow for the training as well as for the serving. Furthermore, the actual training of our Tensorflow model is performed using Google’s AI-Platform, which can be equipped with the necessary computing power.
Ultimately, we can say that using cloud infrastructure and sharing code between training and serving makes the machine learning part production-ready.
In a further blog post, I will take a further look at the Feature Engineering-process and describe what kind of transformations we use to encode our past audience data. So stay tuned…