Closing the Data-Quality Loop

Producing high quality validation corpora without the traditional time and cost inefficiencies.

photo of Hunter Kelly
Hunter Kelly

Software Engineer

Posted on Jul 18, 2017

To be able to measure the quality of some of the machine learning models that we have at Zalando, “Golden Standard” corpora are required.  However, creating a “Golden Standard” corpus is often laborious, tedious and time-consuming.  Thus, a method is needed to produce high quality validation corpora but without the traditional time and cost inefficiencies.


As the Zalando Dublin Fashion Content Platform (FCP) continues to grow, we now have many different types of machine learning models.  As such, we need high quality labelled data sets that we can use to benchmark model performance and evaluate changes to the model.  Not only do we need such data sets for final validation, but going forward, we also need methods to acquire high-quality labelled data sets for training models.  This is becoming particular clear as we start working on models for languages other than English.

Creating a “Golden Standard” corpus generally requires a human being to look at something and make some decisions.  This can be quite time consuming, and ultimately quite costly, as it is often the researcher(s) conducting the experiment that end up doing the labelling.  However, the labelling tasks themselves don't always require much prior knowledge, and could be done by anyone reasonably computer literate.  In this era of crowdsourcing platforms such as Amazon's Mechanical Turk and CrowdFlower, it makes sense to leverage these platforms to try to create these high quality data sets at a reasonable cost.


Back when we first created our English language Fashion Classifier, we bootstrapped our labelled data by using the (now defunct) DMOZ, also known as the Open Directory Project.  This was a site where volunteers, since 1998, were hand categorizing websites and webpages.  A web page could live under one or more "Categories".  Using a snapshot of the site, we took any web pages/sites that had a category that contained the word "fashion" anywhere in it's name.  This became our “fashion” dataset.  We then also took a number of webpages and sites from categories like "News", "Sports", etc, to create our “non-fashion” dataset.

Taking these two sets of links, and with the assumption that they would be noisy, but "good enough", we generated our data sets and went about building our classifier.  And from all appearances, the data was "good enough".   We were able to build a classifier that performed well on the validation and test sets, as well as on some small, hand-crafted sanity test sets.  But now, as we circle around, creating classifiers in multiple languages and for different purposes, we want to know:

  • What is our data processing quality, assessed against real data?
  • When we train a new model, is this new model better?  In what ways is it better?
  • How accurate were our assumptions regarding "noisy but good enough"?
  • Do we need to revisit our data acquisition strategy, to reduce the noise?

And of course, the perennial question for any machine learning practitioner:

  • How can I get more data??!?


Given that Zalando already had a trial account with CrowdFlower, it was the natural choice of crowdsourcing platform to go with.  With some help from our colleagues, we were able to get set up and understand the basics of how to use the platform.

Side Note: Crowdsourcing is an adversarial system

Rather than bog down the main explanation of the approach with too many side notes, it is worth mentioning up-front that crowdsourcing should be viewed as an adversarial system.

CrowdFlower "jobs" work on the idea of "questions", and the reviewer is presented with a number of questions per page.  On each page there will be one "test question", which you must supply.  As such, the test questions are viewed as ground truth and are used to ensure that the reviewers are maintaining a high enough accuracy (configurable) on their answers.

Always remember, though, that a reviewer wants to answer as many questions as quickly as possible to maximize their earnings.  They will likely only skim the instructions, if they look at them at all.  It is important to consider accuracy thresholds and to design your jobs such that they cannot be easily gamed.  One step that we took, for example, was to put all the links through a URL shortener ( see here), so that the reviewer could not simply look at the url and make a guess; they actually had to open up the page to make a decision.

Initial Experiments

We created a very simple job that contained 10 panels with a link and a dropdown, as shown below.


We had a data set of hand-picked links to use as our ground-truth test questions, approximately 90 fashion links, and 45 non-fashion links.  We then also picked some of the links we had from our DMOZ data set, and used those to run some experiments on.  Since this was solely about learning how to use the platform, we didn't agonize over this data set, we just picked 100 nominally fashion links, and 100 nominally non-fashion links, and uploaded those as the data to use for the questions.

We ran two initial experiments: the first one we had tried to use some of the more exotic, interesting "Quality Control" settings that CrowdFlower makes available, but we found that the number of "Untrusted Judgements" was far too high compared to "Trusted Judgements".  We simply stopped the job, copied it and launched another.

The second of the initial experiments proved quite promising: we got 200 links classified, with 3 judgements per link (so 600 trusted judgements in total).  The classifications from the reviewers matched the DMOZ labels pretty closely.  All the links where the DMOZ label and the CrowdFlower reviewers disagreed were examined; there was one borderline case that was understandable, and the rest were actually indicative of the noise we expected to see in the DMOZ labels.

Key learnings from initial experiments:

  • Interestingly, we really overpaid on the first job.  Dial down the costs until after you've run a few experiments.  If the “Contributor Satisfaction” panel on the main monitoring page has a “good” (green) rating, you’re probably paying too much.
  • Start simple.  While it is tempting to play with the advanced features right from the get-go, don't.  They can cause problems with your job running smoothly; only add them in if/when they are needed.
  • You can upload your ground truth questions directly rather than using the UI, see these CrowdFlower docs for more information.
  • You can have extra fields in the data you upload that isn't viewed by the user at all; we were then able to use the CrowdFlower UI to quickly create pivot tables and compare the DMOZ labels against the generated labels.
  • You can get pretty reasonable results even with minimal instructions.
  • Design your job such that "bad apples" can't game the system.
  • It's fast!  You can get quite a few results in just an hour or two.
  • It's cheap!  You can run some initial experiments and get a feeling for what the quality is like for very little.  Even with our "massive" overspend on the first job, we still spent less than $10 total on our experiments.

Data Collection

Given the promising results from the initial experiments, we decided to proceed and collect a "Golden Standard" corpus of links, with approximately 5000 examples from each class (fashion and non-fashion).  Here is a brief overview of the data collection process:

  • Combine our original DMOZ link seed set with our current seed set
  • Use this new seed set to search the most recent CommonCrawl index to generate candidate links
  • Filter out any links that had been used in the training or evaluation of our existing classifiers
  • Sample approximately 10k links from each class: we intentionally sampled more than the target number to account for inevitable loss
  • Run the sampled links through a URL shortener to anonymize the urls
  • Prepared the data for upload to CrowdFlower

Final Runs

With data in hand, we wanted to make some final tweaks to the job before running it.  We fleshed out the instructions (not shown) with examples and more thorough definitions, even though we realized they would not be read by many.  We upped the minimum accuracy from 70% to 85% (as suggested by CrowdFlower).  Finally, we adjusted the text in the actual panels to explain what to do in borderline or error cases.


We ran a final experiment against the same 200 links as in the previous experiments.  The results were very similar, if not marginally better than the previous experiment, so we felt confident that the changes hadn't made anything worse.  We then incorporated the classified links as new ground truth test questions (where appropriate) into the final job.

We launched the job, asking for 15k links from a pool of roughly 20k.  Why 15k?  We wanted 5k links from each class; we were estimating about 20% noise on the DMOZ labels.  We also wanted a high level of agreement, so links that had 3/3 reviewers agreeing.  From the previous experiments, we were getting unanimous agreement on about 80% of the links seen.  So 10k + noise + agreement + fudge factor + human predilection for nice round numbers = 15k.

We launched the job in the afternoon; it completed overnight and the results were ready for analysis the next morning, which leads to...


How does the DMOZ data compare to the CrowdFlower data?  How good was "good enough"?


We can see two things, right away:

  1. The things in DMOZ that we assumed were mostly not fashion, were, in fact, mostly not fashion.  1.5% noise is pretty acceptable.

  2. Roughly 22% of all our DMOZ "Fashion" links are not fashion.  This is pretty noisy, and indicates that it was worth all the effort of building this properly labelled "Golden Standard" corpora in the first place!  There is definitely room for improvement in our data acquisition strategy.

Now, those percentages change if we only take into account the links where all the reviewers were in agreement; the noise in the fashion set drops down to 15%.  That's still pretty noisy.

So what did we end up with, for use in the final classifier evaluations?  Note that the total numbers don't add up to 15k because we simply skipped links that produced errors on fetching, 404s, etc.


This shows us, that similar to the initial experiments, that we had unanimous agreement roughly 80% of the time.

Aside: It's interesting to note that both the DMOZ noise and the number of links where opinions were split work out to about 20%.  Does this point to some deeper truth about human contentiousness?  Who knows!

So what should we use to do our final evaluation?  It's tempting to use the clean set of data, where everyone is in agreement.  But on the other hand, we don't want to unintentionally add bias to our classifiers by only evaluating it on clean data.  So why not both?  Below are the results of running our old baseline classifier, as well as our new slimmer classifier, against both the "Unanimous" and "All" data sets.


Taking a look at our seeds and comparing that to the returned links, we find that 4,023 of the 15,000 are links in the seed set, with the following breakdown when we compare against nominal DMOZ labels:


Key Takeaways

  • Overall, the assumption that the DMOZ was "good enough" for our initial data acquisition was pretty valid.  It allowed us to move our project forward without a lot of time agonizing over labelled data.
  • The DMOZ data was quite noisy, however, and could lead to misunderstandings about the actual quality of our models if used as a "Golden Standard".
  • Crowdsourcing, and CrowdFlower, in particular, can be a viable way to accrue labelled data quickly and for a reasonable price.
  • We now have a "Golden Standard" corpus for our English Fashion Classifier against which we can measure changes.
  • We now have a methodology for creating not only "Golden Standard" corpora for measuring our current data processing quality, but a method that can be extended to create larger data sets that can be used for training and validation.
  • There may be room to improve the quality of our classifier by using a different type of classifier, that is more robust in the face of noise in the training data (since we've established that our original training data was quite noisy).
  • There may be room to improve the quality of the classifier by creating a less noisy training and validation set.


Machine Learning can be a great toolkit to use to solve tricky problems, but the quality of data is paramount, not just for training but also for evaluation.  Not only here in Dublin, but all across Zalando, we’re beginning to reap the benefits of affordable, high quality datasets that can be used for training and evaluation.  We’ve just scratched the surface, and we’re looking forward to seeing what’s next in the pipeline.

If you're interested in the intersection of microservices, stream data processing and machine learning, we're hiring.  Questions or comments?  You can find me on Twitter at @retnuH.

We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Applied Scientist!

Related posts