Author: Raphael Schönenberger, Data Scientist@Adtrac

 

In my last blog post Audience Prediction in Digital-out-of-Home (DooH)-marketing, I discussed how audience prediction can help DooH-media owners sell guaranteed views, which in turn helps their clients, the advertisement agencies, to address specific target-groups. In addition, we have seen that forecasting the audiences requires data about its past. But such audience data is huge and can easily consist of millions of records per day, especially when numerous digital screens are involved. Hence, collecting this data can be challenging, even more if you are a start-up company or small business. In this blog post, I will show you how we at Adtrac overcome such challenges using Google Cloud Platform (GCP).

Why choose a cloud provider?

Choosing a cloud provider, like GCP or Microsoft Azure, has two major advantages for start-up companies or small businesses over building their own infrastructure.

  • No need for hiring a team responsible for setting up and maintaining the infrastructure. This saves money.
  • Scaling the infrastructure based on your current needs is simple.

Therefore, cloud solutions provide a good way for start-up companies or small businesses in particular to deal with large amounts of data without spending much money on infrastructure.

However, these benefits come at the price of sharing data with third parties. Hence, if you deal with sensitive data cloud solutions are usually not applicable. Nevertheless, at Adtrac we don’t have to deal with such concerns, because Advertima, our data provider, delivers us with anonymous audience data (click here for more information). For this reason, we have opted for the Google Cloud Platform.

Data Pipeline at Adtrac using GCP

Request data from external infrastructure

In our particular case, audience data doesn’t need to be refreshed more than once per day. Although the setup would allow this case, too.
Hence, it all starts with the Cloud Scheduler which triggers a cloud function over an http-request once in the middle of the night. The schedule of the trigger is configurable by a simple crontab.
After this trigger the Cloud Function performs the actual request for the data provider, Advertima in our case. Therefore, it sets all necessary parameters like the time range of interest. These parameters are sent towards Advertima using a POST request. Advertima then processes the request and once the data is ready, it’s pushed back to the initial Cloud Function. For the transfer, however, this data is contained in a zipped .csv file, which the Cloud Function must first unzip before storing it on an empty bucket on Cloud Storage.

Load the data into your data warehouse

A second Cloud Function monitors this particular bucket. As soon as a file is placed here, the function triggers a Cloud Composer instance. Luckily, this part doesn’t have to be written on your own, since Google Cloud Platform provides a ready-made solution for this trigger (see here).
Cloud Composer is a GCP-managed Apache Airflow-instance on which you can run directed acyclic graphs (DAG) as like you would do with any other Apache Airflow instance. Fortunately for cost saving reasons, our Cloud Composer instance can be kept to a minimum, because it just manages the pipeline but does NOT process the data itself. So what does the DAG in particular do?

The DAG has two main purposes:

  • Start and monitor the main Dataflow-job
  • Move the .csv-file from its current location into an archive bucket as soon as the job is finished.

For those of you who are not familiar with Dataflow. Dataflow is a GCP-managed Apache Beam instance. Apache Beam on the other hand is a tool for defining and executing data processing workflows. Thus, that’s exactly where you define custom row transformations that will be applied on the data. In fact, each record in the initial .csv-file is loaded, transformed and finally stored into a BigQuery table by Dataflow.
But besides its simplicity in defining workflows one of the beauties of Dataflow/Apache Beam is that it scales depending on the incoming number of records. Therefore, processing more than 10M records/day is easily possible.

Finally, all our data is transferred from a .csv-file into a BigQuery table, while at the same time the original file was moved into an archive.
We will then use this table for all further applications like audience prediction or exploratory analysis.

Bring your code to life

Something, which is often neglected but is crucial for an industrial-ready data pipeline is its management of the code. It’s important to note that the code running in a data pipeline should be handled the same way other code is treated. This includes the usage of Building/Deployment-tools.
At Adtrac, we store all our pipeline-code in a Github-repository. Each successful pull-request on the main branch triggers Cloud Build. Cloud Build is a CI/CD platform, which is configurable via a .yml-file. This configuration contains instructions on how to deploy all tools involved in the data pipeline like the Cloud Functions. One big advantage of this approach is that it ensures that the pipeline keeps maintainable. This is something you definitely should care about even at an early stage of the business.

Conclusion

Often Audience data includes millions of records per day. Processing these records needs a specialized infrastructure. At Adtrac we decided towards a cloud provider, which in our case is Google (Google Cloud Platform). With various tools from GCP including Cloud Scheduler, Cloud Functions and Cloud Dataflow, we request the data from our data provider, transform each record and store it finally in BigQuery. This setup allows it to process more than 10M records per day.
But processing large data is not enough for having an industrial-ready pipeline. Code management is equally important. For this purpose, Adtrac uses Cloud Build from GCP.

Having now the audience data, in a next blog post I will focus on the infrastructure necessary for running audience predictions. So stay tuned…