Learnings from Distributed XGBoost on Amazon SageMaker

What I learned from distributed training with XGBoost on Amazon SageMaker.

Machine Learning Engineer

Posted on Jun 22, 2020

Tags:

Overview

XGBoost is a popular Python library for gradient boosted decision trees. The implementation allows practitioners to distribute training across multiple compute instances (or workers), which is especially useful for large training sets.

One tool used at Zalando for deploying production machine learning models is the managed service from Amazon called SageMaker. XGBoost is already included in SageMaker as a built-in algorithm, meaning that a prebuilt docker container is available. This container also supports distributed training, making it easy to scale training jobs across many instances.

Despite SageMaker handling the infrastructure side of things, I found that distributed training with XGBoost and SageMaker is not as easy as simply increasing the number of instances. I discovered a few small "gotchas!" when attempting a few simple trainings. This post will step through my failed attempts, and end with a genuine distributed training with XGBoost in Amazon SageMaker.

Experiment Setup

I wanted to get an intuitive idea of how well the training time with XGBoost scaled as the number of instances scaled, as well as the training time when the data size increases. I am not especially interested in producing the "best" model to solve a problem, per-say, but there is a natural trade-off between training time and model accuracy that should considered.

For a data set, I used the Fashion MNIST by Zalando Research. The problem itself is to classify small images (28x28 pixels) of clothing as being from 1 of 10 different classes (t-shirts, trousers, pullovers, etc). The data set has 60,000 images for a training set and 10,000 images for a validation set.

To increase the training size, I duplicate the training data to measure the scaling of model training time as the computational resources change. The number of times the training data is duplicated is referred to as the "replication factor". For a typical ML project, you probably don't want to duplicate the training set outright. Although doing so improves our training and validation accuracies here, this method is likely not as efficient as changing hyperparameters (however, you might create new images with noise to improve regularization). For reference, the size on disk of the training data for different replication factors is provided below.

Replication factor 1: 0.63 GB, 60,000 images
Replication factor 2: 1.24 GB, 120,000 images
Replication factor 4: 2.48 GB, 240,000 images
Replication factor 8: 4.95 GB, 480,000 images

I wanted to use hyperparameters that would give a somewhat reasonable performance for accuracy, so I used a hyperparameter tuning job in SageMaker, with one instance per training. I tuned all of the tunable hyperparameters, except "num_round", which was fixed to 100. This hyperparameter increases the number of decision trees used, and increases training time and accuracy as its value increases. My hyperparameters were as follows:

hps = {'alpha': 0.0,
       'colsample_bylevel': 0.4083530569296091,
       'colsample_bytree': 0.8040025839325579,
       'eta': 0.11764087266272522,
       'gamma': 0.43319156621549954,
       'lambda': 37.547406128070286,
       'max_delta_step': 10,
       'max_depth': 6,
       'min_child_weight': 5.076838893848415,
       'num_round': 100,  # Not tuned: kept fixed
       'subsample': 0.8915771964367318,
       'num_class': 10,  # Not tuned: defined by Fashion MNIST
       'objective': 'multi:softmax'  # Not tuned: defined by Fashion MNIST
      }

There are additional hyperparameters than those listed above which are not tunable. I took those as their default value (which, as you will see, can cause some unexpected results). The full list of hyperparameters offered by XGBoost is different from the those offered by the SageMaker container as SageMaker adds a few additional hyperparameters which do not control model performance. The objective "multi:softmax" produces a metric called merror, which is defined as #(wrong cases)/#(all cases).

Lastly, the tools. I wrote all of the code for my experiments in Python 3.7 using the Amazon SageMaker Python SDK. I used the SageMaker docker container version 0.90-1 for XGBoost, the URI of which can be found by using the SageMaker Python SDK:

from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(region, 'xgboost', repo_version='0.90-1')

For each of the SageMaker training jobs, I used the ml.m5.xlargeinstance.

Failed Attempt: Naive Distributed Computing

My first attempt was to check how the training time scaled as the number of instances increases. I expected to see a roughly linear improvement: if the number of instances doubles, then the training time should be cut in half.

I used a naive approach: other than the settings mentioned in the "Experiment Setup" section, I used default values, including a replication factor of 1. What I found was very different from my expectations:

null

There are two things to note here. Going from 1 to 2 instances increases the training time, though I expected to see the training time cut in half. Going beyond 2 instances, the training time is relatively flat.

Going from 1 to 2 instances demonstrates an internal switch of non-distributed to distributed training with XGBoost. There is a hyperparameter called tree_method which sets the algorithm used for computing splits at a node in a decision tree of XGBoost. The default for tree_method is "auto". For one instance, their greedy algorithm called "exact" is used. For more than 1 instance, an algorithm called "approx" is used, which approximates the greedy algorithm. The logic behind "auto" is explained in the XGBoost documentation (as well as other algorithm choices) and the implementation of "exact" and "approx" are described in the XGBoost paper from Chen and Guestrin.

The second thing to note is that after two instances, the training time remains flat as more instances are added. This is because each instance is training using the same data. In SageMaker, unless otherwise specified, the entire training data set is distributed to each instance. This setting is called "FullyReplicated". However, XGBoost expects that each instance receives a subset of the full data set. Another way to think of this is that the training data is completely replicated a number of times equal to the number of instances, and then each copy sent to each instance.

The data distribution can be corrected by sharding the training data (dividing it into different files, one for each instance), and defining an s3_input object such as

import sagemaker
s3_input_train = sagemaker.s3_input(s3_data=s3_location,
                                    content_type='csv',
                                    distribution='ShardedByS3Key')

and then starting a training job by passing the s3_input objects for training (and a similar one for validation):

xgb.fit(inputs={'train': s3_input_train, 'validation': s3_input_validation})

Take-Aways

XGBoost takes different default actions for the hyperparameter tree_method when moving to distributed training from non-distributed training. We should be mindful of this when estimating training times and when tuning hyperparameters.
XGBoost expects data to be split for each of the instances, but SageMaker by default sends the entire data set to each instance. We need to set the data distribution to "ShardedByS3Key" in SageMaker to match the expectations of XGBoost.

Failed Attempt: Using the Greedy Algorithm

To correct my previous failed attempt at distributed training, I made two changes to my experiment:

I set the value of the hyperparameter tree_method to "exact", so that each training job uses the same value of tree_method.
I set the data distribution for SageMaker to "ShardedByS3Key", and divided my training set randomly so that each instance gets a different piece of the training set.

In addition to the expected training times (i.e. doubling the number of instances cuts the training time in half), I also tried increasing the replication factor to get a sense of the scaling of training time compared to the size of the training set. I expect something similar: if the size of the training data doubles, then the training time should double.

null null

The first plot shows the training times for each of the 4 replication factor. The second plot is the same as the first, but in log scale. The dotted lines indicate my expected training time (i.e. doubling the number of instances should halve the training time).

This actually looks pretty good! The training times match well with what one might expect. The trainings with higher replication factors require more time to run computations as there is more data to process. In fact, it's about a factor 2 increase in the training time when the training data size is doubled. It's also worth pointing out that more training data results in better scalability. In fact, with lower replication factors, the training time plateaus (actually, it even increases a little) with a larger number of instances. This would suggest that the overhead costs are eating the benefits of distributing the workload.

At first glance, everything seems to be ok: more training data implies longer training times, more compute resources implies shorter run times. But a check of the training and validation errors shows that something is not right:

null null

As the number of instances increases, the error for training and validation increases. This is an artifact of the hyperparameter tree_method. For distributed training, XGBoost does NOT implement the "exact" algorithm. However, SageMaker has no problem letting us select this value in the distributed trainings. In this situation, the training data is divided among the instances, and then each instance calculates its own XGBoost model, ignoring all other instances. Once each instance is finished, the model from the first instance is saved, and the others are discarded.

The timing and error graphs reflect this behavior: as the number of instances increases, the training data on any given instance is smaller, resulting in faster trainings but worse error. A cheaper way to replicate this experiment is to throw away a percentage of the training data and then train with only one instance.

Take-Aways

Don't use "exact" for the value of tree_method with distributed XGBoost, because it's not actually implemented on the XGBoost side. Use instead "approx" or "hist".

Successful Attempt: Distributed XGBoost with SageMaker

After the learning from the previous attempt, I repeated the experiment, but this time using "approx" for tree_method. This does introduce a new hyperparameter, called sketch_eps, for which I use the default value.

null null

The scaling looks good here and similar to those from experiment 2, albeit with longer training times. A check of the training and validation errors is more satisfying:

null null

From the training and validation errors, we do see noise appearing. Note that there is randomness to using XGBoost: the piece of the training set given to each instance was selected randomly for each training, and node splitting in a decision tree has randomness (see hyperparameters like subsample or colsample_bylevel).

Take-Aways

Using many instances with a "low" amount of training data is a waste of computational resources. For example, using a replication factor of 1, the training time of using 10 instances is not much better than using 3 or 4.
When the training data is sufficiently large, doubling the number of instances approximately halves the training time.
The scaling in training data size is about what we expect: doubling the training data approximately doubles the training time.

Conclusion

Amazon SageMaker makes it easy to scale XGBoost algorithms across many instances. But with so many "knobs" to play with, it's easy to create an inefficient machine learning project.

We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Machine Learning Engineer!