Feature Extraction: Science or Engineering?
It's time for feature engineering to stop being neglected in the machine learning literature.
Every time our customers visit the Zalando Fashion Store, we want to serve them personalised product recommendations, depending on their preferences. Some people like leather jackets, others don't. Others love ankle boots, while some prefer sneakers. Whilst some follow the latest trends and others prefer the classic style.
In a nutshell, the task of a personalised recommender system involves building user profiles from their behavior and predicting which product recommendations will be most relevant to such profiles. Intuitively, the user profile specifies properties such as how much interest the customer has in sportswear, or whether flat heels are preferred over high heel shoes.
Machine learning offers plenty of different algorithms for building personalised recommendation engines, ranging from collaborative filtering to content-based ranking. Virtually any model, especially content-based ones, relies on a representation of the user profile (as well as the candidate recommendations) as a collection of attributes (or “features”) to be served as input to the recommender system. An overwhelming majority of the available machine learning literature assumes that suitable features exist somewhere in a data store and are just waiting to be fed into the designed algorithm.
Yet, coming up with a suitable collection of features usually takes way more time and effort than learning a reasonably accurate model from the data. This task is usually referred to as "feature engineering". In this post, I'll sketch out the solution we came up with for organising the feature extraction jobs that are run behind our recommender systems. I'll also challenge some assumptions that people often make when thinking of feature engineering as preliminary to predictive modeling.
Zalando’s approach to feature engineering
The goal with our solution was to give proper emphasis to a task, which is arguably the most crucial part of the whole machine learning pipeline. On the other hand, I want to highlight the importance of correctly understanding the impact of what is probably the most belittled (or at least neglected) data modeling task ever.
Let's consider first how our feature extraction jobs are combined into a pipeline, where the final step results in a machine-learned ranking model. Take a look at the following diagram:
Everything starts from user action logs, which record every action performed by customers in the shop. The first job for us is to aggregate all actions on a per-user basis. The difficulty here is that the volume of server logs is massive and distributed over several machines. Therefore, we cannot afford to crunch all of this data every time we need to retrieve actions for a given customer. Pre-aggregation of customer actions allows subsequent jobs to selectively retrieve the relevant information in a much more convenient way.
Once the user histories have been aggregated, they are used for two purposes. On the one hand, we can extract a number of dynamic article features (i.e. quantities that change over time), such as number of visits received by the product detail page over a given time window, number of purchases, and so on.
Moreover, we extract a wide range of user attributes, such as time elapsed since the user was last active on the shop, or how the user choices are distributed over different product brands. Optionally, user feature extraction can also exploit already computed article features. For example, if we want to measure how user purchases are affected by product popularity, then for each purchase, we need to check how popular the corresponding product was at the time it was purchased (e.g. in terms of click-through rate).
Finally, user and article features are combined together in order to estimate a scoring function that can rank user-article pairs in terms of goodness of fit. Ideally, an accurate ranking model will give higher scores to pairs such that the article is more relevant to the user profile, as opposed to other available candidates from the product catalog. The learned scoring function is deployed to our live services in order to rank the recommendations we serve, in different contexts, to Zalando customers.
Challenges and opportunities
As it's clear from the workflow sketched above, feature extraction poses significant challenges in terms of architectural design and computational burden. Coming up with an optimal design is hard because of the rich interdependencies linking the different jobs, where even a small problem in a single component can have a tremendous downstream impact on the whole system.
Having said that, the computational burden also arises from having to deal with way more data than we'll actually use in the final predictive modeling tasks. Whatever information we decide to extract for use in a machine learning model, we have to extract it from a virtually unbounded amount of data. Here, useful signal is literally buried under massive collections of seemingly chaotic events, which can be very hard to bring to meaningful representations. Yet, the quality of the resulting predictions will depend much more significantly on the quality of the used features than on any other modeling aspect. As Pedro Domingos puts it in this paper, “[a]t the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”
This brings us to the core problem I want to emphasise. Because of its sheer technical complexity, it takes a serious engineering effort to design a robust and efficient feature extraction pipeline. Additionally, because of its downstream impact on the entire predictive modeling workflow, it is utterly impossible to build long-lasting infrastructure for feature extraction without deep knowledge of the modeling requirements that will arise afterwards from the machine learning components.
Therefore, outstanding engineering skills and strong machine learning experience are crucial, non-separable requirements of the feature engineering process. Here, my claim is that any attempt to separate concerns for software architecture design on the one hand and machine learning modeling on the other hand is bound to affect the chance of success. Based on this consideration, it's more and more baffling to observe how little attention this task has raised so far in the scientific community.
Feature engineering has been typically neglected in the machine learning literature, as it has always been regarded as “engineering, not science”. However, feature engineering cries for scientific analysis. In my opinion, the only reason it's yet to be regarded as a science is that its laws have never been investigated, let alone understood, by scientists.