Building a dynamic inventory optimisation system: A deep dive

This technical blog outlines how we built a scalable inventory optimization system to help partners maintain a profitable inventory.

Alva Presbitero

Sr. Applied Scientist

Andreas Syren

Pr. Applied Scientist

Hagop Dippel

Applied Science Manager

Lalita Awasthi

Jr. ML Engineer

Shikhar Dev

Sr. ML Engineer

Zhe Nie

Sr. Applied Scientist

Posted on Jun 30, 2025

Tags:

In e-commerce, optimising replenishments is a crucial inventory problem. This involves solving three sub-tasks: What articles should be in stock? When should they be replenished? Where should the inventory be optimally allocated in the network of warehouses?

Moreover, most e-commerce supply chains involve complex environments:

Vast catalogue: up to millions of articles
Multi-echelon network: dozens of warehouses spread across several countries
Diverse and rotating catalogue: seasonal goods rotating on pre-defined and specific windows of sale
High uncertaintyon key decision factors: Fluctuating demand patterns, and fluctuating shipment or supplier lead times.

At ZEOS, we recognise that our partners share these challenges. To empower them, we're developing AI-driven replenishment recommendations.

The scale and complexity that this inventory problem brings makes it a unique combined Applied Science and MLE problem to solve. How can a system that continuously updates decisions consider these constantly changing and uncertain factors? The answer lies in building a dynamic inventory optimisation system.

The article will cover:

Brief overview of the inventory optimisation framework
Deep dive into how we scale demand forecasting and accelerate research in our demand forecasting pipelines
Deep dive into how we run optimisation at scale in our policy optimisation pipelines

Optimisation framework

We frame replenishment decisions as a cost-optimisation exercise, with the end goal of minimising inventory costs:

\(Min\ Costs(\theta) = C_{storage}(\theta) + C_{lost\ sales}(\theta) + C_{overstock}(\theta) + C_{operations}(\theta) + C_{inbound}(\theta)\)

In simpler words, we want to find optimal decisions \(\theta^*\), that:

Reduce stockouts to avoid the cost of lost sales
Limit inventory in warehouses at any point in time to reduce stock-holding costs
Balance the long-term cost of overstock with the short-term cost of lost sales
Satisfy the operational constraints/logistics setup (lead times, desired review frequency, …)
Capture the stochastic nature of the decision-making process

To do so, we rely on a 2-step flow:

Step 1: We generate or gather the required inputs, such as probabilistic demand forecasts, returns lead-time forecasts, shipment lead times, user/item economics, the latest known stock state, and stock in transit.
Step 2: All inputs are fed into a recommendation engine that leverages Monte Carlo simulations and black-box gradient-free optimisers for optimisation under uncertainty.

The demand forecasts and replenishment optimisation system are the core components, both in terms of impact and engineering complexity, which will deserve deep dives later in the article.

Overview

Figure 1: Describing the 2-step flow deployed to generate user-facing inventory recommendations

Overarching building blocks and design philosophy

We break the inventory optimisation problem into two isolated but connected building blocks: Demand Forecast and Inventory Optimisation. The Demand Forecast pipeline is a batch prediction pipeline that produces probabilistic forecasts for articles at a weekly cadence. The Inventory Optimisation pipeline offers daily batch predictions, as well as a real-time inference endpoints to enable our B2B partners to interactively plan inventory settings. This service is enabled for our partners via the partner portal, which provides a holistic picture of inventory health and other metrics and KPIs for our partners.

Both pipelines are implemented using zFlow, an internal machine learning ecosystem that offers seamless integration and abstractions for AWS and Databricks infrastructure. This enables us to focus on the machine learning application code without the overhead of building and maintaining complex infrastructure code. zFlow provides out-of-the-box security through in-transit and at-rest encryption for all artefacts, and enables orchestration via AWS Step Functions.

Scalable demand forecasts for millions of articles

To effectively manage our supply chain, we must accurately forecast demand for a vast number of products (SKUs) on a weekly basis. This requires a scalable and efficient forecasting system.

Demand Forecaster

Figure 2: End-to-end flow of the demand forecasting pipeline

The following steps describe the flow of orchestration from right to left.

1. Feature Engineering: Data Pre-processing and Data Transformation Layers

We start by extracting features from curated data products including sales and availability information for all articles across warehouses and sales channels. Numerous (data) engineering teams across Zalando build and maintain these curated data products on a centrally governed data lakehouse, ensuring compliance with relevant access control protocols. In view of scalability, efficiency and interpretability, we recognize two complementary stages for feature engineering: data pre-processing and data transformation. The following table summarizes the design rationale for these stages.

Criteria	Data Pre-Processing	Data Transformation
Primary Objective	Model upstream data products to represent the business problem in a human-understandable structure, enabling easier validation, analysis.	Engineer features from pre-processed data to maximize predictive signals for model training.
Example Transformations	Joins, Filters, Aggregations, etc	Encoding, Normalization, etc
Libraries and Frameworks	PySpark, Spark-SQL	Pandas, Scikit-learn, Numpy, Numba
Architectural Advantage	Distributed processing using PySpark enables efficiently transforming large volumes of upstream data.	Significantly improved efficiency due to feature extraction on pre-processed data.
Scalability	PySpark enables horizontal scalability in the number of worker nodes as data volume grows.	Dependent libraries lack native distribution support, so we rely on vertical scalability to handle increasing data volumes.

1.1 Data Pre-processing Layer

The goal of this stage is to construct a time-series representation for all articles’ sales and availability over a configurable timeline. In our case, we use a 2.5-year timeframe to enable the model to capture seasonal patterns without overemphasising older historical performance. Although this process involves processing large data volumes, it avoids complex statistical or vectorised feature engineering. Leveraging this condition, we implement a fast and distributed processing pipeline using PySpark and Delta Lake running on transient job clusters in Databricks.

1.2 Data Transformation Layer

The transformation layer in the Sagemaker processing job handles all feature engineering tasks on the time-series dataset generated in the previous step. Key transformations include:

deriving historical demand from sales and stock/availability data
pricing information: initial and discounted prices on weekly levels
article metadata (category, colour, material, etc.)
unique identifier per time-series: we treat each combination of (article_id, merchant_id) as a unique entity.

Forecasting specific features such as target lags/transformations, exogenous features lags/transformations, and other temporal features is handled later on by Nixtla’s MLForecast. This allows us to leverage optimised transformations from Nixtla (with Numba under the hood).

2. Model Training and Predictions

After extensive experimentation with deep learning models like TFT and other machine learning approaches, we selected the LightGBM model integrated with Nixtla’s MLForecast interface as the foundation of our demand forecasting pipeline. This stack enables significant advantages, including high-level abstractions for time series-specific feature generation with optimised performance, rapid prototyping through shorter feedback loops, and access to a robust, well-maintained open-source ecosystem. Due to the ML model’s lightweight training footprint, we bypass complexity, like for example not needing checkpointing, or separate infrastructure for inference. Instead, model training as well as model inference are executed in a single pipeline using AWS SageMaker Training Jobs. This approach reduces complexity, lowers infrastructure costs, and accelerates the pipeline. The final output of this stage is a 12-week probabilistic demand forecast for each (article_id, merchant_id, week) combination.

3. Post Processing

Finally, we process the demand predictions to ensure a time series representation suitable for downstream optimisation algorithms. This stage also includes a statistical analysis of model performance and the computation of key business metrics. These metrics are seamlessly integrated into our monitoring and alerting ecosystem, facilitating proactive detection of model drift. The post-processing is implemented using AWS SageMaker Processing Jobs, while the monitoring and alerting system utilises AWS CloudWatch alarms and AWS Lambda functions to deliver alerts to relevant channels.

Our weekly forecasting pipeline processes 3 years of historical data for 5 million SKUs (size and colour) using a sliding window approach, and takes less than 2 hours. This high performance pipeline is enabled by a deliberate focus on data model design and I/O efficiency. We maintain a low total cost of ownership while ensuring reliability and scalability guarantees by leveraging zFlow and AWS-native services in our pipeline.

Translating Demand Forecasts into Actionable Inventory Strategies

With demand forecasts in hand, the next crucial step is determining how to effectively utilise this information. How can we extract value from these stochastic predictions and apply them to real-world inventory management decisions?

Our inventory optimisation service provides both real-time and batch recommendations for optimal stock levels across all article SKUs for each partner. The real-time system allows partners to interactively adjust recommendations based on their specific inventory and stock parameters. Once these settings are established, we proactively cache both the settings and the resulting recommendations on a daily basis. This ensures that our offline batch process consistently delivers up-to-date, dynamic recommendations, taking into account the latest inputs, forecasts, and stock states.

The following diagram illustrates both the real-time and batch prediction processes, flowing from right to left.

Replenishment Recommender

Figure 3: End-to-end flow of the inventory optimisation pipeline

1. Feature Generation

Similar to the demand forecaster approach, feature generation is divided into two components. Transformations that can be fully implemented in PySpark are handled within Databricks, while operations that require the Scipy or Numpy ecosystem are performed in the Sagemaker processing job. The final output of the feature generation process is a detailed feature vector for each SKU. This vector includes historical outbound data, inventory states, inbound volumes, pricing information, article metadata, cost factors, return lead time weights, and probabilistic demand forecasts for the next 12 weeks.

2. Feature Store

The input feature vector generated from the previous step will be persisted in the SageMaker Feature Store for both online and offline storage options. The offline store, backed by Amazon S3, is designed for cold storage use cases such as batch pipelines, archiving, and debugging, operating in append mode. It stores daily datapoints and updated feature vectors resulting from inventory settings changes, ensuring long-term data retention.

While offline feature store optimises for cost efficient high throughput data IO with latency in the order of minutes, online storage is optimised for low-latency, low throughput applications, providing lookup access to only the latest valid feature vectors—either daily generated vectors or the most recent user-triggered updates. It guarantees a latency of 10–20ms per SKU for both read and write operations, enabling fast interaction for both batch input generation pipelines and online serving systems.

3. Optimisation

Optimisation here refers to optimising the stock replenishment predictions based on predicted demand and other user inputs about inventory settings. As discussed above, we provide online as well as offline optimisation recommendations for our partners. It’s important to note that the inventory optimisation algorithm and input features are synchronised between the two subsystems (online and offline), ensuring consistency across both engines. The following subsections provide an algorithmic overview of our optimisation approach, followed by the online and offline delivery mechanism for the algorithm.

3.1 Offline Delivery Mechanism

The offline (batch) engine generates daily recommendation reports using finalised inputs from offline feature stores. We execute the optimisation algorithm for the latest inventory setting for all merchants and articles using SageMaker batch transform jobs, followed by a post-processing layer implemented in Sagemaker Processing job. Similar to the demand forecaster, the post-processing job here evaluates our optimisation performance, enabling proactive model performance and drift monitoring. Once recommendations are computed, they are stored in S3, and a "report generated" notification is published to the respective event stream.

3.2 Online Delivery Mechanism

The online optimisation engine enables partners to interactively optimise predictions based on inventory settings. When partners update their inventory settings, we trigger an orchestrated workflow that queues each update request on AWS SQS. We then use AWS Lambda to poll the queue for updates and serve each update request asynchronously. For each inventory update, we fetch the feature vector for relevant SKUs from the online feature store, and execute the optimisation algorithm with multi-threading parallelism. Once optimal predictions have been calculated, we store the results in s3 and alert the backend systems via a notification in the event stream. Lastly, in addition to serving the online request, we also persist the inventory setting update to the offline feature store, making future offline predictions consistent.

Key scalability takeaways

Our approach prioritises scalability in three key areas:

Robust Pipelines: We leverage a robust infrastructure combining Databricks and AWS Sagemaker for data transformations/processing and model training/inference. Every run triggers dedicated Databricks Job clusters and Sagemaker processing/training jobs. This ensures robust and independent runs and resources, i.e. a failure of one execution in the Databricks job cluster does not impact a parallel execution.
Fast data and vectorised transformations: For data and vector transformations, we rely on PySpark, Numba and Joblib multi-core parallelisation. Whenever possible, we vectorise operations and rely on Numba, which can often offer speedup by a factor of 2 or 3 compared to Numpy.
Light models: We leverage Nixtla’s MLForecast with conformal inference, and LighGBM for probabilistic forecasts. Beyond the speed and scalability of LGBM, we want to emphasise the benefits of using a library like Nixtla, which can automate many time series features and processes required just before training.

We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Applied Scientist!