0% found this document useful (0 votes)
43 views17 pages

Causal Forecasting For Pricing

This paper proposes a novel causal forecasting method for demand prediction in pricing contexts. The method combines double machine learning for causal inference with transformer models for forecasting. Experiments on synthetic and real-world retail data show the method outperforms standard forecasting when prices change, while performing similarly otherwise.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views17 pages

Causal Forecasting For Pricing

This paper proposes a novel causal forecasting method for demand prediction in pricing contexts. The method combines double machine learning for causal inference with transformer models for forecasting. Experiments on synthetic and real-world retail data show the method outperforms standard forecasting when prices change, while performing similarly otherwise.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Causal Forecasting for Pricing

Douglas Schultz1 , Johannes Stephan1 , Julian Sieber1 , Trudie Yeh1 , Manuel Kunz1 , Patrick Doupe1 , and Tim
Januschowski1
1 Zalando SE
{douglas.schultz, johannes.stephan, julian.sieber, trudie.yeh, manuel.kunz, patrick.doupe, tim.januschowski}@zalando.de
January 24, 2024

Abstract We take an opinionated approach in the sense that predic-


arXiv:2312.15282v2 [stat.ML] 23 Jan 2024

This paper proposes a novel method for demand forecasting


tive accuracy is what we focus on, but the model we present
in a pricing context. Here, modeling the causal relationship here heavily leans on causal inference machinery in particu-
between price as an input variable to demand is crucial be- lar the Double Machine Learning (DML) framework (Cher-
cause retailers aim to set prices in a (profit) optimal man- nozhukov et al. 2017). Our contributions are as follows:
ner in a downstream decision making problem. Our methods • We present a novel forecasting modeling framework us-
bring together the Double Machine Learning methodology
for causal inference and state-of-the-art transformer-based
ing the classic DML split into an outcome model, a treat-
forecasting models. In extensive empirical experiments, we ment model and an effect model. For each model, we use
show on the one hand that our method estimates the causal state-of-the-art transformer based models.
effect better in a fully controlled setting via synthetic, yet re- • We design & provide synthetic, but realistic data for em-
alistic data. On the other hand, we demonstrate on real-world pirical evaluations in a fully-controlled environment on
data that our method outperforms forecasting methods in off- the one hand, and show on the other hand, how real-world
policy settings (i.e., when there’s a change in the pricing pol-
data can be used in counterfactual scenarios for effective
icy) while only slightly trailing in the on-policy setting.
evaluation via commonly occurring natural experiments
or how to mimic them effectively.
1 Introduction
In empirical evaluations, we show that our model performs
Time series forecasting in practical applications commonly roughly on par with state-of-the-art forecasting models in
feeds into decision problems in multiple domains (Petropou- a standard, on-policy setting, but has a clear advantage in
los et al. 2021). We consider the special case of an online off-policy settings where the forecast horizon contains price
fashion retailer, where demand forecasts play a key role in policies that haven’t been observed in the training set.
setting optimal prices for a large collection of articles Li Our paper is structured as follows. We formalize the prob-
et al. (2021). Our task consists of predicting demand sub- lem setting in Section 2. We present our model in Section 3
ject to different price levels or discounts which the online and evaluate it in Section 4 on both synthetic, open source
retailer controls at least partially.1 data sets and a real-world, closed source data set. We discuss
Our use case requires two types of estimates to make related work in Section 5 and conclude in Section 6.
pricing decisions. First, we need to predict demand at dif-
ferent prices for multiple weeks in the future. In online re-
tail contexts the focus is mainly on predicting demand lev- 2 Problem Setting and Background
els (Seeger, Salinas, and Flunkert 2016; Wen et al. 2017; For any time series x, x0:T is short-hand for
Kunz et al. 2023a). Second, we need to understand the [x0 , x1 , . . . , xT ]. The observational time series data of an
causal effect of price changes on demand to choose among article i at time t starting at 0 is given by {qi0:t , di0:t , zi0:t },
price levels. The price elasticity of demand is the percent- where q denotes the demand, d corresponds to the discount,
age change in demand for a percentage change in price. which is the percentage of price reduction relative to the
An elasticity is useful in setting prices as in simple cases article’s recommended retailer price; and z a set of article
an elasticity can be used with marginal costs to set optimal specific covariates. These can include past demand in
prices (Phillips 2021). Our use case is more complex. We particular, but also time-independent variables such as
also need the forecasted level of demand at different prices catalog information. The object of interest is
for multiple weeks in the future. So we need to combine
forecasts with causal inference to make good pricing deci- P (qit+1:t+h |do(dt+1:t+h ), q0:t , d0:t , z0:t+h ; θ) , (1)
sions.
that is, the probability distribution of demand in the forecast
Our paper bridges the gap between forecasting and causal
horizon t + 1 : t + h conditioned on covariates and dis-
inference in the context of demand forecasting for pricing.
counts in the forecast horizon on which we can intervene,
1
There is also a competitive component in pricing that we ig- hence do(dt+1:t+h ). A standard approach is to simplify (1)
nore in the context of this work. to a conditional expectation that is estimated via some time

1
series model, without explicitly modeling the effect of inter-
ventions.
E[qit+1:t+h |dt+1:t+h , q0:t , d0:t , z0:t+h ] . (2)
To model these interventions we often assume conditional
ignorability, positivity and consistency (Hernán and Robins
2010; Chernozhukov et al. 2017; Cunningham 2021). In this
work we do not assume these as we’re interested in improv-
ing our forecasts, not estimating treatment effects. For in-
stance, given the dynamic patterns in the data we might not Figure 1: The architecture of the DML Forecaster.
adjust fully for all confounders and not meet conditional ig-
norability. Meeting these assumptions will result in unbiased
treatment effect estimates and improve estimates. troduce in the following. Fig. 1 depicts the high-level archi-
A standard approach to estimate the effect of an interven- tecture.
tion is via DML, which we introduce briefly. While DML
is typically used to estimate binary or discrete treatment ef- The Nuisance Models Each of the two nuisance3 mod-
fects (Chernozhukov et al. 2017), we take ideas from DML els provide estimates q̃ and d˜ of q and d given z respec-
for estimating the effect of a continuous treatment variable: tively. We call the model that provides q̃ the outcome model
weekly average discount, with an outcome of demand. As and the model that provides d˜ the treatment model. Here,
in Chernozhukov et al. (2017), we introduce DML using a we choose standard transformer-based forecasting mod-
partial linear model: els (Vaswani et al. 2017) for their robustness and proven
performance in an online retail setting (Eisenach, Patel, and
q = dθ + g(z) + u, E[u|z, d] = 0 (3)
Madeka 2020; Rasul et al. 2021; Zhou et al. 2021a).
d = m(z) + v, E[v|z] = 0 Each of the outcome and treatment prediction models
Here our target q (demand) depends on the control input d have the same architecture and only differ in target and fi-
(discounts), effects of the environment z and independent nal activation. We use softplus as the final activation
noise u. θ is the linear effect of d on our target q, and thus the function in the outcome model to enforce positivity. For our
causal parameter of interest. The effect of z on q is passed treatment model, we do not pass the (linear) combination
through the function g that can adopt any shape. Further- learned by the last layer through a (non-linear) activation
more, the treatment d is affected by our environment z via function. As we will see from the functional form of the ef-
m as well as some independent random component v. fect model head, this is helpful given the multiplicative na-
DML undergoes two stages: the nuisance stage and the ef- ture of the effect model. Around each attention step, there
fect stage. The nuisance stage includes two nuisance models is a residual connection, and after each attention step there
which predict treatment (discount) and outcome (demand), is a position-wise feed forward network with layer normal-
whereas the latter is computed without using future discount ization and dropout. We use an L1 loss for training our nui-
as input. The ground truth treatment and outcome are then sance models on real-world data and an L2 loss when fitting
residualized using the predictions of these nuisance models our synthetic data set. Other choices of losses are possible
and passed on to the effect model in order to compute a treat- and our approach readily extends to these, in particular for
ment effect. The final output is then the output of the effect probabilistic scenarios (Gneiting, Balabdaoui, and Raftery
model taken together with the output of the outcome model 2007).
and the desired treatment. Typically, all three of these mod-
els are trained separately with separate losses. Note that, the The Effect Model The effect model combines the treat-
training of the effect model uses the output of the nuisance ment and outcome models to provide the final estimate of
models and therefore requires a special treatment. demand q in our DML Forecaster which we denote as q̂. We
The benefit of orthogonalization is that we account for now show how our model estimates the price elasticity of
regularization bias (Chernozhukov et al. 2017), which af- demand
fects S-Learners2 like (2). In the standard practice, discounts ∆q p
ϵ := · , (4)
are treated like any other independent variable and thus reg- q ∆p
ularized/shrunk in order to improve predictions. This regu- where q is demand of an article and p is the price. If we
larization biases estimates of the causal effect between dis- assume mild integrability conditions, then basic integration
counts and demand (Chernozhukov et al. 2017). gives us
 p ϵ
1
3 DML Forecaster q1 = q0 , (5)
p0
Our approach for a causal forecaster follows the DML ap-
proach and it hence consists of three submodels that we in- where qi is the demand at price pi (see Appendix B for de-
tails).
2
S for Single learner. We can calculate treatment effects by
3
augmenting the treatment feature and subtracting. E.g. E[q|d = We use this term to remain consistent with the causal inference
0.5] − E[q|d = 0.4]. This is otherwise known as G-computation in literature; however, the outcome nuisance model is of primary in-
epidemiology and other fields. terest for our use case.
Our idea is to parameterize ϵ by a neural network. Given has the advantage of forecasting ahead on items that the nui-
retail price x, we can write the discounted price as x·(1−dt ) sance models have seen during training.
where dt is the discount at time t. Furthermore, we assume
Discussion: Departures from the DML Literature
that the forecast of the outcome model qet is an estimate of
We’re interested in forecasting demand levels for different
the sales at the price level predicted by the discount model
discount rates. The DML literature is interested in estimat-
x · (1 − det ). Substituting these into Eq. (5), we can compute ing changes in demand levels for changes in discount rates.
our final demand estimate qb at time t as Although we use cross fitting to train our model, at inference
 1 − d ψ(z)
t we depart from this, for improved forecast performance.
qbt = qet 1 ≤ t ≤ s, (6) Second, we use a single effect model, instead of separate,
1 − det averaged treatment effect estimations on each half of the
where ψ is a transformer model whose output is the elastic- dataset. Third, we use an outcome model that reflects our
ity ϵ in (5) and s is the length of the forecast horizon. Note, understanding of the problem space, and not one justified
that while ϵ is assumed to be constant here, it still is param- for treatment effect estimation.4
eterized over z so it can vary by features used in estimation.
Our model to parameterize ϵ is similar to the nuisance mod- 4 Experiments
els and only lacking the decoder self-attention as we expect In this section, we present experimental results of the DML
elasticity to be relatively constant within the forecast hori- Forecaster in a fully controlled setting with synthetic data
zon. The outcome model accounts for the auto-regressive and on real-world data. We start by discussing practical de-
part of each time series. We use a negative softplus as tails around the DML Forecaster.
final activation as we expect elasticity to be negative (Varian
2014, Chapter 15) and an L1 loss for training. Baseline Models and Accuracy Metrics
For training the nuisances and effect models, we deploy a We compare the DML Forecaster to the following models:
two-stage training process, where we fit the nuisance models
in the first stage and the effect model in the second stage. • Naı̈vely-causal Transformer (TF): A time-series trans-
The first stage nuisance models generate estimates for the former architecture with a special output head that mod-
second stage effect models. els price elasticity more generally than (5) via a piece-
To avoid overfitting, we deploy two-fold cross-fitting dur- wise linear, monotone function (Kunz et al. 2023b).
ing training in a similar manner to (Chernozhukov et al. • SARIMAX: A vanilla seasonal ARIMA model with ex-
2017, Section 3). We have an even and odd copy of each ogenous covariates. In cases where the training length
nuisance model, each of which are trained on one half of the was less than 30, or the model fitting process failed, we
data set. We use nuisance models trained on odd data to infer use the previous week’s value as a fallback. For our ex-
outcomes for even data, and vice-versa. This data is used to periments, we use Darts 0.21.0 (Herzen et al. 2022), co-
train a single effect model. variates such as stock and discount variables from previ-
The splitting of the data set into even and odd parts is done ous time steps were included, and preprocessing involved
according to the index of the item i. In the particular instance log transformation and forward filled for missing values
of demand forecasting, we can derive an index from article in demand, stock (in z), and discounts.
information such that articles of the same size are guaranteed • TWFE elasticities: A standard econometric baseline via a
to be either even or odd indexed while still having a (close causally informed, elasticity-based forecast using a two-
to) random split between different articles. way fixed-effect Poisson regression model (Bergé 2018).
Appendix C contains more details.
Inference with the DML Forecaster Once the model is
trained, we need to infer future outcomes for different dis- • sDML: As part of our ablation study, this model imple-
count levels. We combine two methods here, one influenced ments the DML Forecaster (see Section 3) without the
by the above cross fitting procedure and one influenced by nuisance model for predicting the treatment. Instead the
standard forecasting methods. We ensemble these two meth- treatment is provided directly to the effect model without
ods with a geometric mean, where cf indicates cross fitting residualization.
and f indicates forecasting • No Cross Fitting: Cross fitting is applied to the DML
v Forecaster as described in Section 3. For our ablation
u f 1 − dt ψ(z) f  1 − d ψ(z) study, we create variants of sDML and DML models
u 
t
qbt = tqtcf · qtf (7) without cross fitting (sDML-no cf and DML-no cf).
cf ff
1 − dt 1 − dt
f
We have chosen these models to represent the vari-
In the cross fitting procedure we pass the odd (even) batches ety of approaches typically deployed for such prob-
to the even (odd) nuisance models, and then receive an in- lems (Januschowski et al. 2020): (i) local forecasting models
ference from the effect model. We do this to account for po- 4
We could have assumed an outcome function which depends
tentially overfit models. neither non-linearly nor log-linearly on treatment, and used a
The standard forecasting practice is to use the model learned weighted sum of the raw output of the effect transformer
trained on old data to infer future outcomes. To implement as the final output, with the small modification of also providing
this we pass even (odd) batches to the even (odd) nuisance the nuisance outputs and true discounts to the effect transformer.
models, and then pass the output to the effect model. This Such a model showed similar metrics in preliminary experiments.
(SARIMAX), (ii) econometric approaches (TWFE) and (iii) demand demand (off policy) TF DML Forecaster discount
global, transformer-based forecasting methods. no-discount scenario full-discount scenario
For the accuracy metrics, we use standard metrics mean 40
125

discount in %
absolute error (MAE) and mean squared error (MSE) (Hyn- 30
100

demand
dman and Athanasopoulos 2017), and the so-called demand 20
75
error, a metric that captures the down-stream pricing depen- 10
dency (see Kunz et al. (2023a)): 50
0
50 60 70 50 60 70
v
week week
u P Pt+h
u bi (q̂i,T − qi,T )2
DT,h = t i PT =t+1 Pt+h . (8)
2
i T =t+1 bi qi,T Figure 2: Synthetic demand time series (black), the associ-
Here, t is the last timepoint in the training set, h is the fore- ated realized discount (green) and off-policy forecasts for
cast horizon, q̃i,T is the prediction for article i at timepoint DML Forecaster (blue) as well as TF (cyan).
T , qi,T is the corresponding true demand and bi is the rec-
ommended retail price of article i.
Here treatment effects ei are article dependent, but constant
Hyperparameter Tuning over time. Note however that elasticity will not be constant
over time: ϵi,t = (b)ei pit .
The following provides an overview on how we select the qit +pit ei
(b)
hyper-parameters. More details are in Appendix D. The base demand qit is
the product of two time depen-
Synthetic Dataset We use Bayesian optimization (Akiba dent components: a noisy trend τit that either leads to a lin-
et al. 2019) to tune key hyperparameters of the DML Fore- ear increase/decrease of demand over the course of the arti-
caster and TF. To mimic a realistic tuning, we use the data cle life cycle, and a seasonality term sit :
of the first 50 weeks of our simulated data whereas we keep b
qit = (τit · sit + 1) · (ci λit + ηit ). (10)
weeks 46-50 as a hold out set to select the best hyperparam-
The seasonality has a period of 30 weeks with an article-
eters, and thus use the first 45 weeks for training. In the case
dependent phase shift in order to simulate different sea-
of DML-no cf, we reuse the same hyperparameters found for
son types. In addition, we scale our time-dependent com-
the DML Forecaster. For sDML and sDML-no cf, we only
ponent with an article-specific factor ci as well as indepen-
need to re-tune the effect model, as the nuisance outcome
dent additive- and multiplicative noise (ηit and λit respec-
model is used the same way as in our DML Forecaster.
tively). Note, because of the product form in (10), our sim-
Real-World Data Both nuisance models have an input di- ulated noise is scale dependent on the base demand. Given
mension of 66, with multiple attention layers in encoder and our recipe to generate demand, we initialize the simulation
decoder, and 22 attention heads. The batch size for all nui- for each article i at week t = 0 by setting an initial stock
sance models is 1200 time series windows, and each had a and price pi1 . Our goal is to clear the given stock at t = 99,
learning rate scheduler of the form lrn 7→ exp(α) · lrn := the season end. We therefore simulate a pricing policy that,
lrn+1 , where lrn is the learning rate in the nth training step. at any given week t > 3, computes the average demand over
For the effect model we use twice the batch size as for the the past four weeks (t − 1, t − 2, . . . , t − 4). We then use
nuisance models (2400) which is due to the cross-fitting pro- this estimate to predict the week number at which the given
cedure (see Section 3). Moreover, we use a simple learning article i will run out of stock by mere linear extrapolation.
rate scheduler of the form If we estimate to clear our stock after t = 99, we decrease
lr our price by 10% w.r.t. our base price pi0 in order to set pit .
lr 7→ √ := lrn Conversely, if we expect to clear stock before season end,
n+1 we increase pit by 10%.5
where lrn is the learning rate after the nth training step and Importantly, using such a pricing strategy, treatment is
lr is the initial learning rate. confounded by the long-term seasonal pattern of simulated
demand (see example time series in Fig. 2). This leads to
Experiments on Synthetic Data higher article discounts when the seasonal component of the
We start by providing a high-level overview on the construc- simulation is low (Fig. 2, left panel) and lower discounts
tion of synthetic data to evaluate our approach in a controlled when seasonal demand is high (Fig. 2, right panel). We
setting (see Appendix E for further details on the data gen- chose a total of four different periods for training: weeks
erating process). 20-65, 30-75, 40-85, as well as weeks 50-95 and evaluate
We simulate entire life cycles (100 weeks, typical in the alternative methods on the five weeks that follow each train-
online fashion industry) of around 4500 stock keeping units. ing interval (weeks 66-70, 76-80, 86-90 and weeks 96-100
Demand in a given week t of article i qi,t is a linear function respectively). The evaluation consists of two parts: on-policy
of price pi,t and an article specific factor ei (treatment effect) evaluation, where we predict demand under the pricing pol-
(b) icy used in the simulation, as well as off-policy evaluation,
as well as a base demand qit , i.e.
5
We will open-source the data and data generation process (im-
(b)
qit = qit + pit ei (9) plemented in (Alexandrov et al. 2020) as part of the publication.
MAE MSE MAE effect MSE effect
Model type Off policy On policy Off policy On policy
TF 16.3±0.5 11.5±0.4 745.7±38.6 490.6±19.4 45.8±1.0 3350.4±164.6
DML 12.4±0.7 10.0±0.7 658.6±40.6 472.9±33.9 25.0±1.7 1743.9±187.7
DML-no cf 12.4±0.7 10.1±0.7 663.2±49.0 473.6±33.4 22.9±2.7 1458.2±212.9
sDML 20.5±0.5 11.0±0.7 922.3±34.7 501.8±36.0 89.1±0.7 10356.5±251.9
sDML-no cf 20.5±0.6 11.0±0.7 919.4±37.2 499.8±35.7 89.5±1.1 10424.0±219.8

Table 1: Error metrics predicting out-of-sample demand in study of 4500 simulated articles. See text for further details.

more, we find that the advantage of using DML over TF is


250 increasing with the size of the treatment effect (Fig. 3).
Table 1 further contains an ablation study which shows
200
treatment effect

the results of two-stage methods that only learn a nuisance


model for predicting demand (sDML and sDML-no cf) and
150
find that they generally perform inferior in off-policy set-
100 tings and in terms of estimating the effect of price changes.
With this simulation setup, we cannot confirm the ben-
50 efit of using cross-fitting as the performance of DML and
DML-no cf as well as (sDML and sDML-no cf) does not
0
differ significantly across all error metrics we report here
20 15 10 5 0 5 (Table 1). We have three explanations for this. First, that the
MAE(DML Forecaster) MAE(TF)
ensembling in (7) removes the benefit of cross fitting. Sec-
ond, that some residual confounding may be large enough
Figure 3: The improvement of the DML Forecaster (the to obscure the benefits of cross fitting. Last, cross fitting is
more negative on the x-axis the more improvement) over TF implemented to improve efficiency and statistical power. We
increases with more elastic articles. may have enough data to fit the model.

where we predict demand under five alternative discount lev- Cyberweek: Off-policy Discount Increase
els that range from 0-50% discount (w.r.t our initial price
pi0 ). We repeat training and inference of all models five One way to test the price response of the models consid-
times to compute empirical standard deviations. In Fig. 2 ers certain time periods where the discount policy follows
we show off-policy predictions of the DML Forecaster and a shifted distribution. In particular, cyber week is such a
TF when applying 0% discount to weeks 65-70 (left panel) yearly event when many articles have discounts that are
and the full discount (50%) respectively (right panel). much higher than normally seen during the year. For exam-
In addition to computing the standard metrics MAE and ple, in Fig. 4 we look at the difference in discounts in cyber
MSE on on- and off-policy ground truth, we also report how week 2022 versus two weeks prior at the article level and
accurately our methods predict the treatment effect (MAE we see a general right-ward shift, which indicates the gen-
effect and MSE effect) – as this parameter is directly mod- eral increase in discounting on this special week. Naturally,
eled and inferred by each of the alternative models. similar discount ranges occur during the same week in years
prior, so it would not be an interventional test if each model
Model adaptations for this simulation study Because saw these discount-time distributions in training. In order to
we deviate from the real-world constant elasticity assump- test our hypothesis, we therefore discard cyber week, cyber
tion, we adapt the head effect model as introduced in Eq. (6) week −1, and cyber week +1 from our training data and
accordingly, i.e. our final output is computed as replace them with a set of 3 consecutive weeks that are re-
qbt = qet + ψ(z) · (dt − det ) (11) sampled from the same article. We refer to data sets with
discarded and replaced weeks as off-policy and to data sets
where qet , dt , and det are defined as in Eq. (6). without this replacement as on-policy.
We change TF accordingly, i.e. the head is computed as We validate each model on 2021 cyber week, both on-
in Eq. (11), but we set det = 0. and off-policy, and test each model on 2022, 2020, and 2019
Results We find that our DML Forecaster (DML) consis- cyber weeks on- and off-policy. Each experiment has a fore-
tently outperforms TF when it comes to predicting demand cast horizon of cyber week and cyber week +1, while train-
under off-policy price changes (see Table 1). On policy, the ing on two years of article histories up until cyber week −1.
difference between both models is not significant6 . Further- The number of articles at inference time was 410, 500 for
2022, 208, 212 for 2020, and 144, 980 for 2019. We give an
6
W.r.t. computed empirical standard deviations in depth qualitative description of our data in Appendix F.
Off policy On policy
Target Date Metric DML Forecaster TF SARIMAX DML Forecaster TF SARIMAX
21-11-2022 Demand Error 61.48 80.03 81.33 60.00 54.73 81.08
23-11-2020 65.94 88.05 78.98 62.50 57.37 78.75
25-11-2019 63.61 63.18 73.03 61.85 57.29 69.59
21-11-2022 MAE 7.739 10.14 9.99 7.606 6.931 9.96
23-11-2020 12.92 17.39 14.82 12.39 11.34 14.78
25-11-2019 12.68 12.52 14.31 12.19 11.40 13.94
21-11-2022 MSE 2047 2540 2225 1903 1940 2196
23-11-2020 5018 7361 4891 5092 5032 5062
25-11-2019 5075 5630 4348 5446 5448 4472

Table 2: Table of metrics for experiment dates considering TF and DML Forecaster for both off and on policy evaluation for
cyberweek. All models were trained with an L1 loss function. Metrics read from the test epoch output.

as the ones we consider in Section 4. Our approach is generic


in the sense that it would work with other transformer-based
methods (or indeed, other forecasting methods as long as
they allow for the incorporation of covariates). We chose a
specific architecture for the ease of implementation and cus-
tomization to the pricing use case (Kunz et al. 2023a).
In econometrics, demand estimation via price elasticities
is of central interest (Deaton and Muellbauer 1980; Foga-
rty 2010; Hughes, Knittel, and Sperling 2008; DeFusco and
Paciorek 2017). Often however, forecasting methods are ig-
nored as the focus is understanding how demand changes
when prices or policies change. Recent work has shown how
using forecasting algorithms to complement existing econo-
metric techniques can improve causal inference (Goldin,
Figure 4: A scatter plot of discounts for articles on cyber Nyarko, and Young 2022). We do the opposite, using causal
week 2022 vs. two weeks prior. Each point represents a sin- inference methods to improve forecast estimates.
gle article, and the units on the axes are the ratio of discount,
with 0 being no discount and 1 being full discount. Causal forecasting, that is, the intersection of causal in-
ference and forecasting, is typically only mentioned briefly
in standard forecasting textbooks (see e.g., (Hyndman and
Athanasopoulos 2017)). Similarly, research in causal fore-
Results As shown in Section 4, we find that the TF has
casting is limited to the best of our knowledge. There are
a slight advantage when it comes to predicting demand
some notable exceptions, including the above example us-
on-policy, whereas the DML Forecaster yields better re-
ing forecasting algorithms to improve causal inference es-
sults in the off-policy setting, particularly on the MSE.
timates (Goldin, Nyarko, and Young 2022). For example,
The SARIMAX model is consistently outperformed by both
Vankadara et al. (2022) provide a theoretical framework for
Transformer-based methods.
differentiating the causal from the statistical risk in forecast-
In addition, we evaluate our methods on control dates not
ing. Our work is more pragmatically oriented in the sense
affected by cyber-week sales events (Table 6), and we show
that we do not make assumptions on causal sufficiency (but
how DML improves over TF w.r.t. the degree of discount
also do not obtain theoretical guarantees) and rather focus
change (Fig. 5).
on the empirical validity and evaluation of our approach.
The wider area of estimating counterfactuals is well-
5 Related Work studied especially in a medical setting with discrete treat-
The wider areas of forecasting and causal inference, espe- ments. Melnychuk, Frauen, and Feuerriegel (2022) recently
cially in a pricing context, are well established fields, but provide a transformer-based approach for estimating coun-
typically studied in isolation. terfactual outcomes in a discrete treatment setting with med-
Forecasting with transformers has received considerable ical data. They provide an end-to-end training procedure of
attention in the literature in both academic and industrial re- three sub-models instead of the multi-stage approach pre-
search (Zhou et al. 2021b; Kunz et al. 2023a; Eisenach, Pa- sented here. While a multi-stage approach introduces inef-
tel, and Madeka 2020; Lim et al. 2021) and they are gener- ficiencies, the joint loss in (Melnychuk, Frauen, and Feuer-
ally acknowledged to work well for real-world data sets such riegel 2022) relies on adversarial learning of multiple objec-
tives not in the same domain or scale which is notoriously ased machine learning for treatment and structural parame-
hard to tune. Our work provides the estimation of a causal ters. The Econometrics Journal, 21(1): C1–C68.
parameters of interest, the price elasticity, which (Melny- Cunningham, S. 2021. Causal inference: The mixtape. Yale
chuk, Frauen, and Feuerriegel 2022) doesn’t yield. Simi- university press.
larly, Johansson, Shalit, and Sontag (2016) predict patient
Deaton, A.; and Muellbauer, J. 1980. Economics and con-
outcomes over simulated data using a deep neural network
sumer behavior. Cambridge university press.
approach that corrects for the treatment bias by a given pa-
tient’s medical history. Bica et al. (2020) extend this work DeFusco, A. A.; and Paciorek, A. 2017. The interest rate
to a longitudinal setting, estimating counterfactual patient elasticity of mortgage demand: Evidence from bunching at
outcome timeseries while accounting for time-varying con- the conforming loan limit. American Economic Journal:
founders. A notable exception is (Pawlowski, Coelho de Economic Policy, 9(1): 210–240.
Castro, and Glocker 2020), where factual information of the Eisenach, C.; Patel, Y.; and Madeka, D. 2020. MQ-
period of interest is used in the model in order to compute Transformer: Multi-Horizon Forecasts with Context De-
counterfactuals. Approximating counterfactuals by interven- pendent and Feedback-Aware Attention. arXiv preprint
tional distributions (Johansson, Shalit, and Sontag 2016; arXiv:2009.14799.
Bica et al. 2020; Melnychuk, Frauen, and Feuerriegel 2022) Fogarty, J. 2010. The demand for beer, wine and spirits: a
has the advantage that the resulting methods are by design survey of the literature. Journal of Economic Surveys, 24(3):
applicable to interventional settings like ours. Conversely, 428–478.
our proposed approach may also be used to estimate coun-
Gneiting, T.; Balabdaoui, F.; and Raftery, A. E. 2007. Prob-
terfactual outcomes.
abilistic forecasts, calibration and sharpness. Journal of the
Royal Statistical Society: Series B (Statistical Methodology),
6 Conclusion and Future Work 69(2): 243–268.
We presented a causal forecasting method in a pricing Goldin, J.; Nyarko, J.; and Young, J. 2022. Forecasting
context via DML. Our model relies on state-of-the-art Algorithms for Causal Inference with Panel Data. arXiv
transformer-based forecasting models and, by incorporating preprint arXiv:2208.03489.
DML, allows for off-policy estimations and a better causal
Hernán, M. A.; and Robins, J. M. 2010. Causal inference.
effect estimation than purpose-built, but causally unaware
forecasting methods. The evaluation of such forecasts is no- Herzen, J.; et al. 2022. Darts: User-Friendly Modern Ma-
toriously difficult and we provide synthetic data as well as chine Learning for Time Series. Journal of Machine Learn-
natural experiment data for such evaluations. ing Research, 23(124): 1–6.
Future work should include a probabilistic treatment and Hughes, J.; Knittel, C. R.; and Sperling, D. 2008. Evidence
the incorporation of inverse propensity scores (Lim, Alaa, of a shift in the short-run price elasticity of gasoline demand.
and Schaar 2018), a more flexible outcome model as well as The Energy Journal, 29(1).
the inclusion of multi-variate forecasting models. Hyndman, R. J.; and Athanasopoulos, G. 2017. Forecasting:
Principles and Practice. OTexts; 2014. www. otexts. org/fpp.,
References 987507109.
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; and Koyama, M. Januschowski, T.; et al. 2020. Criteria for classifying fore-
2019. Optuna: A next-generation hyperparameter optimiza- casting methods. International Journal of Forecasting,
tion framework. In Proceedings of the 25th ACM SIGKDD 36(1): 167–177. M4 Competition.
international conference on knowledge discovery & data Johansson, F.; Shalit, U.; and Sontag, D. 2016. Learning rep-
mining, 2623–2631. resentations for counterfactual inference. In International
Alexandrov, A.; et al. 2020. GluonTS: Probabilistic and conference on machine learning, 3020–3029. PMLR.
Neural Time Series Modeling in Python. Journal of Ma- Kunz, M.; Birr, S.; Raslan, M.; Ma, L.; Li, Z.; Gouttes,
chine Learning Research, 21(116): 1–6. A.; Koren, M.; Naghibi, T.; Stephan, J.; Bulycheva, M.;
Bergé, L. 2018. Efficient estimation of maximum likelihood Grzeschik, M.; Kekić, A.; Narodovitch, M.; Rasul, K.;
models with multiple fixed-effects: the R package FENmlm. Sieber, J.; and Januschowski, T. 2023a. Deep Learning
CREA Discussion Papers, (13). based Forecasting: a case study from the online fashion in-
Bica, I.; Alaa, A. M.; Jordon, J.; and van der Schaar, M. dustry. arXiv:2305.14406.
2020. Estimating counterfactual treatment outcomes over Kunz, M.; et al. 2023b. Deep Learning based Fore-
time through adversarially balanced representations. arXiv casting: a case study from the online fashion industry.
preprint arXiv:2002.04083. arXiv:2305.14406.
Chernozhukov, V.; Chetverikov, D.; Demirer, M.; Duflo, E.; Li, H.; et al. 2021. Large-scale price optimization for an
Hansen, C.; and Newey, W. 2017. Double/Debiased/Ney- online fashion retailer. In Innovative Technology at the In-
man Machine Learning of Treatment Effects. American terface of Finance and Operations: Volume II, 191–224.
Economic Review, 107(5): 261–65. Springer.
Chernozhukov, V.; Chetverikov, D.; Demirer, M.; Duflo, E.; Lim, B.; Alaa, A.; and Schaar, M. v. d. 2018. Forecasting
Hansen, C.; Newey, W.; and Robins, J. 2018. Double/debi- Treatment Responses over Time Using Recurrent Marginal
Structural Networks. In Proceedings of the 32nd Interna-
tional Conference on Neural Information Processing Sys-
tems, NIPS’18.
Lim, B.; Arif, S.; Loeff, N.; and Pfister, T. 2021. Temporal
Fusion Transformers for interpretable multi-horizon time se-
ries forecasting. International Journal of Forecasting, 37(4):
1748–1764.
Melnychuk, V.; Frauen, D.; and Feuerriegel, S. 2022. Causal
Transformer for Estimating Counterfactual Outcomes. In
Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvari, C.; Niu,
G.; and Sabato, S., eds., Proceedings of the 39th Interna-
tional Conference on Machine Learning, volume 162 of
Proceedings of Machine Learning Research, 15293–15329.
PMLR.
Pawlowski, N.; Coelho de Castro, D.; and Glocker, B. 2020.
Deep structural causal models for tractable counterfactual
inference. Advances in Neural Information Processing Sys-
tems, 33: 857–869.
Petropoulos, F.; et al. 2021. Forecasting: theory and practice.
arXiv:2012.03854.
Phillips, R. L. 2021. Pricing and revenue optimization. Stan-
ford university press.
Rasul, K.; Sheikh, A.-S.; Schuster, I.; Bergmann, U.;
and Vollgraf, R. 2021. Multivariate Probabilistic Time
Series Forecasting via Conditioned Normalizing Flows.
arXiv:2002.06103.
Seeger, M. W.; Salinas, D.; and Flunkert, V. 2016. Bayesian
intermittent demand forecasting for large inventories. In
Advances in Neural Information Processing Systems, 4646–
4654.
Vankadara, L. C.; Faller, P. M.; Hardt, M.; Minorics, L.;
Ghoshdastidar, D.; and Janzing, D. 2022. Causal Fore-
casting:Generalization Bounds for Autoregressive Models.
arXiv:2111.09831.
Varian, H. R. 2014. Intermediate microeconomics with cal-
culus: a modern approach. WW norton & company.
Vaswani, A.; Shazeer, N.; Parmer, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
tention is all you need. Advances in Neural Information Pro-
cessing Systems (NIPS 2017), 30.
Wen, R.; Torkkola, K.; Narayanaswamy, B.; and Madeka, D.
2017. A multi-horizon quantile recurrent forecaster. In NIPS
2017 Time Series Workshop.
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong,
H.; and Zhang, W. 2021a. Informer: Beyond Efficient
Transformer for Long Sequence Time-Series Forecasting.
arXiv:2012.07436.
Zhou, H.; et al. 2021b. Informer: Beyond Efficient Trans-
former for Long Sequence Time-Series Forecasting. In
The Thirty-Fifth AAAI Conference on Artificial Intelligence.
AAAI Press.
Causal Forecasting for Pricing: supplemental material
Douglas Schultz1 , Johannes Stephan1 , Julian Sieber1 , Trudie Yeh1 , Manuel Kunz1 , Patrick Doupe1 , Tim
Januschowski1 , and 2
1 Zalando SE
{douglas.schultz, johannes.stephan, julian.sieber, trudie.yeh, manuel.kunz, patrick.doupe, tim.januschowski}@zalando.de

A Differences Between Cross-fitting and Sample-splitting


Cross-fitting, as deployed in the DML Forecaster, has a primitive method called sample-splitting. This is a simple version of the
cross-fitting DML estimation described in literature (Chernozhukov et al. 2017). With the DML Forecaster we split the training
data into two randomized subsets (which we refer to as even and odd) as in cross-fitting. However, there is only one of each
outcome and treatment model. The nuisance models are trained on the even part of the dataset, while the effect model is trained
on the odd part with partial ground truth in (6) provided by the nuisance models’ inference on the odd part. At inference time
we use the nuisance and effect models to forecast on the full assortment.
An issue with sample-splitting is that we only use half the dataset for inference. This means that this method is less efficient
since it does not use all of the data for training each of the nuisance or effect model.
In our use case there is no shortage of data, and so it becomes a question as to whether this efficiency is really needed. In
addition, cross-fitting would add increased training and inference time over sample-splitting. We would like to understand the
trade off between the two methods, and therefore designed a small experiment to test the sample-splitting DML model against
the cross-fitting DML model, where the former is designed as in the first paragraph in this sub-appendix.
We tested the sample-splitting DML model on the three cyber week start dates, both in and out of sample, and on the control
dates in sample in order to compare with the cross-fitting DML model. The hyperparameters for each sub-model in the sample-
splitting DML model are taken from the tuning study where cross-fitting was in place. We also measured training time from the
tensorboard logs.
The cross-fitting DML Forecaster performed slightly better than the sample-splitting DML Forecaster off-policy. For on-
policy, the sample-splitting model performed slightly better than the cross-fitting model on two of the cyber week dates and one
of the control dates. On the remaining dates it performed slightly worse. The training time for cross-fitting was significantly
longer than that for sample-splitting. See Table 7 for the results.

B Derivation of Eq. (5)


We make the assumption that demand depends negative monotonically on price for a specific article at a specific time, with all
other factors being held constant. Further, we assume that elasticity is constant with regard to price, demand, and time. We can
then treat Eq. (5) as an equality of differential forms on the positive real line
dq dp
=ϵ .
q p
If we integrate both sides over some interval [p0 , p1 ],
Z p1 Z p1
d(q(p)) dp
=ϵ .
p0 q(p) p0 p
We can make the substitution of q(p) = q in the left hand side, and get
Z q1 Z p1
dq dp

q0 q p0 p
where qi = q(pi ) for i = 0, 1. Thus, we get

log(q1 ) − log(q0 ) = ϵ log(p1 ) − log(p0 ) .
Exponentiating, we arrive at the result:
 p ϵ
1
q1 = q0 .
p0

9
C Elastiticity-based Forecasts
A standard approach to compute demand functions uses estimated elasticities, ε (see for instance Varian (2014), Deaton and
Muellbauer (1980) or Phillips (2021)).
Here, we use estimated elasticities for different groups of articles and past observed demand in order to predict demand for
a given week t, i.e.  ε
1 − di,t
q̂i,t = qi,t−1 .
1 − di,t−1
Due to hidden confounding (e.g., seasonality and advertisement campaigns), naı̈ve regression of log(qi,· ) onto log(1 − di,· )
generally results in biased estimates of the elasticities. To account for this, we use a two-way fixed-effects Poisson regression
model that is defined as follows: 
log E[qi,t ] = ε log(1 − di,t ) + ui + ct . (12)
Here, the parameter ui is the article-specific effect and ct is the week-specific effect. This model is fitted with the standard
within estimator using the R package fixest (Bergé 2018).

D Hyperparameters and Hyperparameter Tuning


Hyperparameter tuning was carried out using the Bayesian search algorithm provided with the python package optuna (Akiba
et al. 2019). We provide a full overview of the selected hyperparameters for the TF model in Table 9 and for the DML Forecaster
in Table 10. A notable difference between TF and DML Forecaster is that we used RAdam in each DML sub-model and
AdamW for the TF model. For tuning the effect model, we trained the nuisance models with optimal hyperparameters and
saved a checkpoint of their weights, and then tuned the effect model for its set of parameters without further training of the
nuisance models.

E Simulation Study
In the following, we give more details about the generation of our synthetic data set. Note that, in the main manuscript, we keep
our presentation concise and on a high level by omitting the weighting we use for individual components and using a different
(but equivalent) parameterization of the noise model.
We simulate a total of 4467 stock keeping units over a period of 100 weeks (i.e t ∈ {0, 1, . . . , 99}). Demand in a given week
t of article i (qi,t ) is a linear function of price pi,t and an article-specific factor ei (treatment effect) as well as a base demand
(b)
qit , i.e.
(b)
qit = qit + pit ei (13)
(b)
The base demand qit is the product of two time dependent components: a noisy trend τit that either leads to a linear in-
crease/decrease of demand over the course of the article life cycle, and a seasonality term sit :
b
qit = (0.15 · τit + 0.25 · sit + 1) · cit . (14)
cit is the article-specific contribution that consists of two sub components ait and bit :
cit = 0.05 · a2it + 0.25 · ait + 0.5 · bit (15)
where
ait = αd(i) + ϵit (16)
and
αd ∼ N (10, 32 ) (17)
is sampled once for each category d ∈ {1, 2, . . . , 45}. Furthermore ϵit is independent noise drawn from N (0, 1) and d(i) is a
(random) mapping that assigns article i to one of a total of 45 categories.
The contribution of bit is computed analogously to ait using a different setting of hyperparemeters and a total of 15 categories:
bit = βk(i) + ψit , ψit ∼ N (0, 52 ) (18)
2
βk ∼ N (300, 50 ), k ∈ {1, 2, . . . , 15}. (19)
The treatment effect ei in Eq. (14) depends on a random component as well as an article-specific component:
(b) (b)
ei = ei · 0.15 · āi , ei ∼ max(1.3, LN (0.75, 0.1252 )) (20)
1
P99
where āi = 100 t=0 ait
Furthermore we chose our initial price pi0 such that we avoid qit < 0 at any week t:
q̄i  q̄i 2
 
pi0 ∼ N , , (21)
3 1.5
1
P99 (b)
where q̄i = 100 t=0 qit . We apply additional filtering steps to exclude articles that have negative demand from our synthetic
data set. The seasonal component sit in Eq. (14) is a sine function with a period of 30 (weeks) and article-dependent shifts
(season types) that are tied to our categorical variable k. In particular, we subdivide the values of k evenly into six subgroups
and sample season shifts for each group uniformly over all integers in the interval [−15, 15].
Lastly, our trend component τit follows a (noisy) linear function, i.e.
τit ∼ N (t · γi , στ2i ) (22)
where
γi ∼ U([−0.02, 0.02]) (23)
and
στi ∼ U([0, 0.15]). (24)
With this demand model in place we set our initial stock z0 such that we clear our simulated inventory in week t = 99 at an
average discount rate of 14%, i.e
99
1 X (b)
z0 = q · (1 − 0.14)pi0 ei . (25)
100 t=0 it
Moving along, we compute demand for the first four weeks, i.e qit for t ∈ {0, 1, 2, 3} keeping the price constant (pi0 = pi1 =
pi2 = pi3 ). For all weeks t we update our stock accordingly:
zt+1 = zt − qit (26)
At any given week t ∈ {4, 5, . . . 100}, we compute the expected number of weeks until we run out of stock (mt ), via a basic
linear extrapolation:
4zt
mt = Pt , (27)
tj =t−3 qitj

which we can use to compute the so-called stock coverage wt :


mt
wt = . (28)
100 − t
A value of wt > 1 implies that demand is too low in order to clear stock at season end, and conversely, a value of wt < 1 would
lead to left over stock after our period of 100 weeks.
Our pricing policy is set up in order to steer wt toward 1 for all t ∈ {4, 5, . . . , 99}. In particular, we define a total of six
discount steps d(jt ) = jt · 0.1 for jt ∈ {0, 1, . . . , 5} for a given week t and adjust our discount step according to the following
probabilistic rule: 
1
jt−1 + 1 : wt > 1, λti > wt

jt = jt−1 − 1 : wt < 1, λti > wt (29)

j
t−1 : otherwise,
where λti ∼ U([0, 1]). We then update our price pit in order to compute demand qit via Eq. (10) as follows:
pit = pi0 · (1 − d(jt )). (30)
We give an overview of the synthetic data set and how we derive features from it in Table 3.

F Details on Experiments for Real World Data


Qualitative description of our data
Our data consists of sales and other recorded properties of fashion articles that were sold at some point in the past via the
retailers online shop. We refer to a single article as Stock Keeping Unit (SKU), and in the following, we consider the so-called
config Stock Keeping Units (cSKUs) that group the same articles of different sizes. Thus, all data presented here is agnostic of
article size and we ignore effects that are the result of an cSKU being available in a limited number of sizes at some point in its
life cycle.
Each cSKU comes with its associated history of weekly-aggregated observations and features. Depending on the context,
we use the shorthand cSKU to also refer to a given article’s history. We give more detail on the recorded history and derived
features available for each cSKU in Table 4.
We use one-hot encoding to compute a high-dimensional numeric vector for each categorical feature that we pass through an
embedding layer to obtain a low-dimensional representation. Similarly, we use embeddings for our ordinal features (Isoweek
number, Days from January first, and Days from Easter). Note, that all embedding layers are an integral part of the neural
Feature Data type Notes
Dynamic Features
Demand Integer ≥ 0 simulated demand
Discount Float range between 0 and 0.5
Stock Integer ≥ 0 available stock
Week number Integer ≥ 0 week number (embedded)
Positional Encoding Float(x17) positional encoding dimensions
Static Features
d Categorical embedded via a learned embedding
k Categorical embedded via a learned embedding
Promotion Binary noise: not having an effect on demand
p0 Integer undiscounted price of the article

Table 3: Overview of syntetic dataset and its usage with the DML Forecaster and TF

Feature Data type Notes


Dynamic Features
Sold Items Integer ≥ 0 sold items before return
Discount Float range between 0 and 0.7
Stock Integer ≥ 0 available stock for a given cSKU
Week number Integer ≥ 0 iso calendar week number(embedded)
Day in year Integer ≥ 0 number of days from January 1st (embedded)
Days from Easter Integer ≥ 0 number of days from Easter (embedded)
Positional Encoding Float(x17) positional encoding dimensions
Static Features
Brand Categorical embedded via a learned embedding
Commodity group Categorical (x5) hierarchical category groups (embedded)
Season type Categorical season type of article (embedded)
Black price Integer undiscounted price of the article

Table 4: Overview of real-world dataset and its usage with the DML Forecaster and TF
C HARACTERISTIC S YNTHETIC C YBER WEEK 2019 C YBER WEEK 2020 C YBER WEEK 2022

N O . T IME S ERIES 4, 467 144, 980 208, 212 410, 500


T IME G RANULARITY WEEKLY WEEKLY WEEKLY WEEKLY
AVG . L ENGTH OF T IME S ERIES 100 104 100 75
FORECAST HORIZON LENGTH 5 2 2 2

Table 5: High-level characteristics of data sets.

Figure 5: A sample of the difference in forecasting error for the TF vs. DML Forecaster on cyber week 2022, measured on the
off-policy experiment.

networks we present here. Thus, during training, we update the parameters of the embedding layers as part of the same gradient
update that we use to optimize the remaining weights of each model.
We treat discount as a continuous variable – even though inventory managers typically reduce prices by increments of five
percent relative to some baseline (black price). In practice, we need a higher resolution as discounts are recorded as weekly
averages. Depending on the time a discount is updated this can result in rather arbitrary decimals. For instance, if the discount
for a given fashion item is increased from 20% to 25% in the middle of a given week, we would record a an aggregated discount
level of 22.5%.

Further Results
The main cyber-week results are given in Section 4 with the comparison to the control dates given in Table 6. On the 2022 and
2020 cyber weeks, the DML Forecaster performed substantially better across all metrics when compared to the TF model. As
we expected, the TF performed mildly better than the DML Forecaster on policy for the cyber weeks. On the 2019 cyber week,
the DML Forecaster and TF models performed similarly in the off policy test, and the TF was once again better on policy.
To understand the magnitude of the results with regard to the experiment design, we look at Table 6. Indeed, the degree of
change in the error metrics for the TF is not drastic when moving from on policy to off policy for each specific date, when
compared to the 2022 and 2020 cyber week dates. The degree of change in the error metrics for the DML Forecaster when
considering the on-to-off policy shift is similar to the cyber week dates.

Further Ablation Study on Real World Data


There are two key elements of the DML Forecaster which are important in √avoiding bias (see the Introduction of Chernozhukov
et al. (2018) for a deeper discussion). The cross-fitting method allows n-consistency in the linear effect case and prevents
the effect model from overfitting. Thus, we can ask if this cross-fitting is necessary in our case: We run the DML Forecaster
Off policy On policy
Target Date Metric DML Baseline DML Baseline
Demand Error
25-04-2022 51.03 57.84 46.55 48.59
06-06-2022 53.76 58.61 47.84 45.61
10-10-2022 52.48 56.18 51.97 51.31
MAE
25-04-2022 8.935 10.15 8.065 8.478
06-06-2022 7.274 8.081 6.361 6.306
10-10-2022 5.183 5.616 5.078 5.042
MSE
25-04-2022 1772 2144 1555 1861
06-06-2022 1009 1319 874.0 1007
10-10-2022 510.9 628.0 454.9 507.6

Table 6: Table of metrics for control dates. We consider baseline and DML Forecaster for both on policy and off policy evalua-
tion. All models were trained with an L1 loss function. Metrics read from the output of the test epoch.

Off policy On policy Train time


Target Date Metric Cf Ss Cf Ss Cf Ss
Demand Error
21-11-2022 61.48 62.46 60.00 58.41 2.55 1.63
23-11-2020 65.94 68.67 62.50 64.24 1.17 0.71
25-11-2019 63.61 67.52 61.85 61.60 0.83 0.52
25-04-2022 46.55 46.28 1.96 1.27
06-06-2022 47.84 50.26 2.11 1.36
10-10-2022 51.97 52.66 2.44 1.61

Table 7: Demand error for the cross-fitting DML Forecaster compared to the sample-splitting DML Forecaster on the cyber
week and control week target dates. Cf = Cross-fitting, Ss = Sample-splitting. Training was done on an AWS Sagemaker
instance of type G4dn.4xlarge. Training time is in hours.

without cross-fitting and compare to the original, on the start date of 21-11-2022, for both in and out of sample performance.
More precisely, we use the even batches with the even nuisance models to provide inference for effect model training, and
similarly for the odd version.
As a second experiment, there is a “simplified” version of DML Forecaster (sDML), that does not orthogonalize the treatment
function, see for example (Chernozhukov et al. 2018, Equation 1.3). Equation 6 simply becomes
ψ(X)
qbt = qet 1 − dt 1 ≤ t ≤ s, (31)
where dt is the desired discount. We test this model on the same start date of 21-11-2022.
As a final experiment, we test the sDML model without cross-fitting.
The goal is that this ablation study will help explain the mechanism by which DML Forecaster improves upon the TF for
out-of-sample forecasting.
There was not a significant difference between cross-fitting and no cross-fitting for either type of DML Forecaster. The
sDML model performed worse on- and off-policy for Demand Error and MAE, regardless of cross-fitting. This is explained by
difference in error between the final output and the outcome model. The effect model in the regular DML Forecaster corrected
the Demand Error by 6.65 resp. 2.57 for off- resp. on-policy over the Demand Error of the outcome model. However, For the
sDML model, this correction was significantly less for off policy, and the effect model even increased the error for on-policy.
See Table 8 for the results.

G Details to hardware and library versions used


We performed all experiments using either amazon web services’ (aws) ml.g5.12xlarge instances (simulation study) or
g4dn.4xlarge instances (experiments on cyber-week data). The experiments were conducted with PyTorch 2.0.0 and Python
Effect error corr.
Model type Metric Off policy On policy Off pol. On policy
Demand Error
DML 61.48 60.00 -6.65 -2.57
DML-no cf 62.7 60.92 -5.43 -1.95
sDML 64.65 64.34 -3.48 +1.77
sDML-no cf 65.96 65.08 -2.17 +2.51
MAE
DML 7.739 7.606
DML-no cf 7.883 7.699
sDML 8.144 8.172
sDML-no cf 8.299 8.279
MSE
DML 2047 1903
DML-no cf 2065 1906
sDML 2067 1899
sDML-no cf 2091 1904

Table 8: Metrics for ablation study for the target date of 21-11-2022. DML-no CF is DML with no cross-fitting, sDML is the
simplified DML without the discount residual. Best Metrics for each block are bold. The effect error correction column is the
difference in error between the outcome model and the output of the effect model (negative numbers indicate a reduction in
error from the outcome model). Metrics read from the output of the test epoch.

3.10 installed.

H Source code
We provide the source code for creating the synthetic data set (sim sub 1.zip) as well as the source code to our models and
their application to simulated data (dml-on-synthetic-data.zip) as part of this submission. We refer to the Readme.md in each
package for further information.
Baseline Parameter Cyberweek Data Simulated Data Notes
Number of layers 6 13 number of residual blocks in the en-
coder and decoder
Hidden dimension 274 51 number of hidden dims in ffn after each
attention step
Head hidden dimension 164 62 in the network that computes the nor-
malized demand slopes
Dropout 0.41 0.43 in all ffns and attention weights
Learning rate 0.0034 0.0224 for the AdamW optimizer
−4
Weight decay 0.037 1.4 × 10 regularization parameter
Beta 1 0.8044 0.8566 decay rate for computing moving aver-
age of gradient in the Adam optimizer
Beta 2 0.6023 0.9140 decay rate for computing moving aver-
age of gradient in the Adam optimizer
Number of linear pieces 2 1 number of pieces for the demand curve
Train epochs 2 23
Total parameters 1.3M 124K

Table 9: Hyperparameters for the TF with production data (cyberweek) and simulated data
DML Parameter Cyberweek Data Simulated Data Notes
Outcome model
Number of layers 3 5 number of residual blocks in encoder and de-
coder
Hidden dimension 130 73 number of hidden dims in ffn after each atten-
tion step
Dropout 0.2 0.1 in all ffns and attention weights
Learning rate 0.0096 0.0088 for the RAdam optimizer
−4 −9
Weight decay 1.4 × 10 5.4619 × 10 regularization parameter
Beta 1 0.7096 0.8566 decay rate for computing moving average of
gradient in the Adam optimizer
Beta 2 0.8585 0.9140 decay rate for computing moving average of
gradient in the Adam optimizer
Gamma 0.9993 0.9388 Exponential decay of learn rate
Train epochs 1 44
Treatment model
Number of layers 2 2 number of residual blocks in encoder and de-
coder
Hidden dimension 160 21 number of hidden dims in ffn after each atten-
tion step
Dropout 0.2 0.1497 in all ffns and attention weights
−4
Learning rate 4.2 × 10 0.0162 for the RAdam optimizer
−4 −9
Weight decay 5.8 × 10 2.6 × 10 regularization parameter
Beta 1 0.7884 0.5977 decay rate for computing moving average of
gradient in the Adam optimizer
Beta 2 0.9536 0.8740 decay rate for computing moving average of
gradient in the Adam optimizer
Gamma 0.9995 0.9549 Exponential decay of learn rate
Train epochs 1 60
Effect model
Number of layers 5 6 number of residual blocks in encoder and de-
coder
Hidden dimension 273 43 number of hidden dims in ffn after each atten-
tion step
Dropout 0.412 0.1750 in all ffns and attention weights
Learning rate 1.07 × 10−8 0.0491 for the RAdam optimizer
−9
Weight decay 0.0375 5.75 × 10 regularization parameter
Beta 1 0.6 0.6405 decay rate for computing moving average of
gradient in the Adam optimizer
Beta 2 0.7044 0.6749 decay rate for computing moving average of
gradient in the Adam optimizer
Train epochs 1 20
Total parameters 1.2M 118K

Table 10: Hyperparameters for the DML Forecaster with production data (cyberweek) and simulated data

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy