Deep Learning in Production for Predicting Consumer Behavior

We are excited to see how user experiences in e-commerce will evolve to personalized encounters.

photo of Matthias Rettenmeier
Matthias Rettenmeier

Delivery Lead – Data Science

photo of Tobias Lang
Tobias Lang

Research Engineer

At Zalando adtech lab in Hamburg, machine learning drives many of our production systems to build great user experiences. Our most recent product requires precise estimates of future interests of Zalando consumers based on their history of interacting with the fashion platform. For example, we want to predict a consumer's interest in ordering selected fashion articles.

We set ourselves the goal to build a powerful and versatile prediction tool that not only fits the task at hand, but is also ready for future product developments. Deep learning approaches have many advantages over traditional techniques, making them a great fit for our requirements. Recurrent neural networks (RNNs) in particular are a promising candidate to provide the methodological backbone for an e-commerce experience that gets more and more personalized.

We have developed a deep learning system based on RNNs and put it into production. Like most new technologies, bringing deep learning into production has its challenges. In the following article, we want to share our experiences and the choices we have made along the way in bringing this product to life.

Deep learning

At Zalando, we are convinced about the potential of deep learning and the value it can add to our products, as well as to the consumer experience. Zalando Research was launched recently, consisting of a group of research scientists that explore novel deep learning solutions.

While not a silver bullet for all scenarios, existing deep learning techniques are already beneficial today. Recurrent neural networks (RNNs) offer several advantages for our product. Most prominently, they operate directly on sequences of data and thus are a perfect fit for modeling consumer histories. Time-intensive human feature engineering is no longer required. Instead, we can focus on building a flexible and versatile model that can be easily extended to new types of input data and applied to a variety of prediction tasks. In general, learning from raw data can help to avoid limitations when placing too much confidence in human domain modeling.

Furthermore, demand for explaining the predictions of machine learning models is increasing strongly. RNNs can be helpful in providing explanations as they make it easy to directly relate event sequences to predicted probabilities.

These advantages convinced us to investigate an RNN prototype. Based on our positive experiences we decided to switch to RNNs altogether, leaving behind traditional techniques such as logistic regression and random forests. The latter methods were part of our stack beforehand.


Moving to production

Major companies have adopted deep learning, alongside machine learning in general, as a major strategy for product development. The detection of handwritten zip codes is one of the earliest success stories of deep learning, which was followed by many popular applications in recent years like photo search at Google, the Skype translator, or multiple production systems at Facebook. While these big companies use deep learning in production with great success, best practices in the wider community are still evolving. The step from research prototypes to prediction models in production is challenging, not only for RNNs, but for machine learning in general.

In our fashion context, a major challenge is that fashion seasons change, popular brands and articles come and go, and a large number of new articles are entering the Zalando platform every day. This necessitates frequent re-training and model deployment. In addition, deep learning approaches have their own set of challenges when moving to production, like computing time, GPU usage and robustness of optimization.

Moving deep-learning machinery into production requires regular data-aggregation-, model-training- and prediction-tasks. We decided in favor of a modular and separated approach for maximum of flexibility and efficiency.


Data Preparation

Before any machine learning is applied, data has to be gathered and organized to fit the input format of the machine learning model. Our raw data consists of tracking data which is collected as an event-stream from the fashion store and saved to AWS S3.

A side remark: Our models are based on the histories of anonymous profiles. (That is, we do not use customer data.) For ease of readability, we speak of consumers instead, but anonymous profiles are what we really refer to.

Months worth of event-stream data have to be compiled into consumer histories that can be inserted directly into our RNNs for training and prediction. We accomplish this by utilizing a data processing pipeline based on Apache Spark. The aggregation jobs run daily on AWS EMR and are scheduled using AWS's data-pipelines: Once yesterday's raw data is available at S3, a new cluster is spawned to transform the newly available data. The output is again written to S3 where it can be picked up by the succeeding tasks.


Training models in production requires efficiency and stability. To limit the number of parameters, we decided to start off with a simple but powerful RNN architecture with a single LSTM-layer. We implemented the model in Torch, together with scripts for training and prediction. Due to data-aggregation, the input data for the RNNs is reduced to multiple gigabytes. Hence, training can be achieved on a single machine without the necessity of a distributed approach. We further enhance efficiency by making use of the computational power of GPUs. Torch integrates the Cuda framework of Nvidia for GPU computing and provides good support for switching between CPU and GPU computing.

We optimized our training code to the GPU setting and achieved multiple times the performance as with CPUs. Current model training with several million consumer histories takes about two hours and a single GPU. Further improvements can be achieved by re-training from previous models instead of starting training from scratch. At the moment, we use an in-house GPU cluster for training but we are currently working on moving to the new p-instances available in AWS EC2.

After training, the models are validated on independent test data, using metrics like AUC and data likelihood. Recording these metrics allows us to monitor stability and enables us to prevent uploading models that do not achieve satisfactory validation performance.


For the current stage of our product, predictions are carried out on demand for batches of data. This simplifies the task, as no real-time prediction system is required. Instead, predictions are scheduled and performed at regular intervals for batches of data. Computing predictions is less involving and thus can be handled with regular CPUs. Calculations for several million consumer histories take about 20-30 minutes on a single machine.

Similar to the data-aggregation tasks, we compute predictions on AWS EC2 instances and use AWS data-pipelines for scheduling. The models used for prediction, which have been trained on our in-house cluster, are stored in S3. The models are picked up together with the input data from worker machines running dockerized Torch environments. These environments are configured to perform the prediction, validation and other post-processing steps required for our case. Again, the resulting output is stored in S3.

During these processing steps, we closely monitor input data as well as prediction results. Key statistics, like the number of data points and the distribution of variables and targets, help to detect major changes in the incoming data distributions. In addition, we track prediction performance by checking various metrics on validation data. General heuristics like the difference between actual and predicted targets provide sanity checks for model health. Alongside these operational stats, more abstract business metrics are captured. These business metrics allow us to understand how our model supports Zalando in delivering value.



The system is live, serving consumers on Zalando today. Our first experiences with operating the live system are positive, both in regards to performance and robustness. Overall, the required efforts for putting deep learning into production have not been much greater than with other machine learning products. The fact that offline predictions are sufficient, however, greatly reduced the complexity of the system. Despite this being the case, we are eager to extend our system to real-time scenarios. For now, our next step is to move model-training to the cloud for a more stable and scalable solution.

Another ongoing topic is the addition of new data sources as input to our RNNs. With the additional inputs we seek to increase prediction accuracy as well as to extend our product to a wider range of use cases. We are in the midst of creating a data ingestion layer that couples various data sources, like article databases and fashion insights, with our RNN system.

Beyond our use case, we are excited to see how user experiences in e-commerce will evolve to truly personalized encounters. RNN production systems are a promising technique to enable this fascinating trend.


Lorenz Knies, Gunnar Selke, Matthias Rettenmeier and Tobias Lang are designing and engineering deep learning systems at Zalando adtech lab; Michael Gravert drives the first product at Zalando that makes use of these deep learning systems.