Causal Forecasting For Pricing
Causal Forecasting For Pricing
Douglas Schultz1 , Johannes Stephan1 , Julian Sieber1 , Trudie Yeh1 , Manuel Kunz1 , Patrick Doupe1 , and Tim
Januschowski1
1 Zalando SE
{douglas.schultz, johannes.stephan, julian.sieber, trudie.yeh, manuel.kunz, patrick.doupe, tim.januschowski}@zalando.de
January 24, 2024
1
series model, without explicitly modeling the effect of inter-
ventions.
E[qit+1:t+h |dt+1:t+h , q0:t , d0:t , z0:t+h ] . (2)
To model these interventions we often assume conditional
ignorability, positivity and consistency (Hernán and Robins
2010; Chernozhukov et al. 2017; Cunningham 2021). In this
work we do not assume these as we’re interested in improv-
ing our forecasts, not estimating treatment effects. For in-
stance, given the dynamic patterns in the data we might not Figure 1: The architecture of the DML Forecaster.
adjust fully for all confounders and not meet conditional ig-
norability. Meeting these assumptions will result in unbiased
treatment effect estimates and improve estimates. troduce in the following. Fig. 1 depicts the high-level archi-
A standard approach to estimate the effect of an interven- tecture.
tion is via DML, which we introduce briefly. While DML
is typically used to estimate binary or discrete treatment ef- The Nuisance Models Each of the two nuisance3 mod-
fects (Chernozhukov et al. 2017), we take ideas from DML els provide estimates q̃ and d˜ of q and d given z respec-
for estimating the effect of a continuous treatment variable: tively. We call the model that provides q̃ the outcome model
weekly average discount, with an outcome of demand. As and the model that provides d˜ the treatment model. Here,
in Chernozhukov et al. (2017), we introduce DML using a we choose standard transformer-based forecasting mod-
partial linear model: els (Vaswani et al. 2017) for their robustness and proven
performance in an online retail setting (Eisenach, Patel, and
q = dθ + g(z) + u, E[u|z, d] = 0 (3)
Madeka 2020; Rasul et al. 2021; Zhou et al. 2021a).
d = m(z) + v, E[v|z] = 0 Each of the outcome and treatment prediction models
Here our target q (demand) depends on the control input d have the same architecture and only differ in target and fi-
(discounts), effects of the environment z and independent nal activation. We use softplus as the final activation
noise u. θ is the linear effect of d on our target q, and thus the function in the outcome model to enforce positivity. For our
causal parameter of interest. The effect of z on q is passed treatment model, we do not pass the (linear) combination
through the function g that can adopt any shape. Further- learned by the last layer through a (non-linear) activation
more, the treatment d is affected by our environment z via function. As we will see from the functional form of the ef-
m as well as some independent random component v. fect model head, this is helpful given the multiplicative na-
DML undergoes two stages: the nuisance stage and the ef- ture of the effect model. Around each attention step, there
fect stage. The nuisance stage includes two nuisance models is a residual connection, and after each attention step there
which predict treatment (discount) and outcome (demand), is a position-wise feed forward network with layer normal-
whereas the latter is computed without using future discount ization and dropout. We use an L1 loss for training our nui-
as input. The ground truth treatment and outcome are then sance models on real-world data and an L2 loss when fitting
residualized using the predictions of these nuisance models our synthetic data set. Other choices of losses are possible
and passed on to the effect model in order to compute a treat- and our approach readily extends to these, in particular for
ment effect. The final output is then the output of the effect probabilistic scenarios (Gneiting, Balabdaoui, and Raftery
model taken together with the output of the outcome model 2007).
and the desired treatment. Typically, all three of these mod-
els are trained separately with separate losses. Note that, the The Effect Model The effect model combines the treat-
training of the effect model uses the output of the nuisance ment and outcome models to provide the final estimate of
models and therefore requires a special treatment. demand q in our DML Forecaster which we denote as q̂. We
The benefit of orthogonalization is that we account for now show how our model estimates the price elasticity of
regularization bias (Chernozhukov et al. 2017), which af- demand
fects S-Learners2 like (2). In the standard practice, discounts ∆q p
ϵ := · , (4)
are treated like any other independent variable and thus reg- q ∆p
ularized/shrunk in order to improve predictions. This regu- where q is demand of an article and p is the price. If we
larization biases estimates of the causal effect between dis- assume mild integrability conditions, then basic integration
counts and demand (Chernozhukov et al. 2017). gives us
p ϵ
1
3 DML Forecaster q1 = q0 , (5)
p0
Our approach for a causal forecaster follows the DML ap-
proach and it hence consists of three submodels that we in- where qi is the demand at price pi (see Appendix B for de-
tails).
2
S for Single learner. We can calculate treatment effects by
3
augmenting the treatment feature and subtracting. E.g. E[q|d = We use this term to remain consistent with the causal inference
0.5] − E[q|d = 0.4]. This is otherwise known as G-computation in literature; however, the outcome nuisance model is of primary in-
epidemiology and other fields. terest for our use case.
Our idea is to parameterize ϵ by a neural network. Given has the advantage of forecasting ahead on items that the nui-
retail price x, we can write the discounted price as x·(1−dt ) sance models have seen during training.
where dt is the discount at time t. Furthermore, we assume
Discussion: Departures from the DML Literature
that the forecast of the outcome model qet is an estimate of
We’re interested in forecasting demand levels for different
the sales at the price level predicted by the discount model
discount rates. The DML literature is interested in estimat-
x · (1 − det ). Substituting these into Eq. (5), we can compute ing changes in demand levels for changes in discount rates.
our final demand estimate qb at time t as Although we use cross fitting to train our model, at inference
1 − d ψ(z)
t we depart from this, for improved forecast performance.
qbt = qet 1 ≤ t ≤ s, (6) Second, we use a single effect model, instead of separate,
1 − det averaged treatment effect estimations on each half of the
where ψ is a transformer model whose output is the elastic- dataset. Third, we use an outcome model that reflects our
ity ϵ in (5) and s is the length of the forecast horizon. Note, understanding of the problem space, and not one justified
that while ϵ is assumed to be constant here, it still is param- for treatment effect estimation.4
eterized over z so it can vary by features used in estimation.
Our model to parameterize ϵ is similar to the nuisance mod- 4 Experiments
els and only lacking the decoder self-attention as we expect In this section, we present experimental results of the DML
elasticity to be relatively constant within the forecast hori- Forecaster in a fully controlled setting with synthetic data
zon. The outcome model accounts for the auto-regressive and on real-world data. We start by discussing practical de-
part of each time series. We use a negative softplus as tails around the DML Forecaster.
final activation as we expect elasticity to be negative (Varian
2014, Chapter 15) and an L1 loss for training. Baseline Models and Accuracy Metrics
For training the nuisances and effect models, we deploy a We compare the DML Forecaster to the following models:
two-stage training process, where we fit the nuisance models
in the first stage and the effect model in the second stage. • Naı̈vely-causal Transformer (TF): A time-series trans-
The first stage nuisance models generate estimates for the former architecture with a special output head that mod-
second stage effect models. els price elasticity more generally than (5) via a piece-
To avoid overfitting, we deploy two-fold cross-fitting dur- wise linear, monotone function (Kunz et al. 2023b).
ing training in a similar manner to (Chernozhukov et al. • SARIMAX: A vanilla seasonal ARIMA model with ex-
2017, Section 3). We have an even and odd copy of each ogenous covariates. In cases where the training length
nuisance model, each of which are trained on one half of the was less than 30, or the model fitting process failed, we
data set. We use nuisance models trained on odd data to infer use the previous week’s value as a fallback. For our ex-
outcomes for even data, and vice-versa. This data is used to periments, we use Darts 0.21.0 (Herzen et al. 2022), co-
train a single effect model. variates such as stock and discount variables from previ-
The splitting of the data set into even and odd parts is done ous time steps were included, and preprocessing involved
according to the index of the item i. In the particular instance log transformation and forward filled for missing values
of demand forecasting, we can derive an index from article in demand, stock (in z), and discounts.
information such that articles of the same size are guaranteed • TWFE elasticities: A standard econometric baseline via a
to be either even or odd indexed while still having a (close causally informed, elasticity-based forecast using a two-
to) random split between different articles. way fixed-effect Poisson regression model (Bergé 2018).
Appendix C contains more details.
Inference with the DML Forecaster Once the model is
trained, we need to infer future outcomes for different dis- • sDML: As part of our ablation study, this model imple-
count levels. We combine two methods here, one influenced ments the DML Forecaster (see Section 3) without the
by the above cross fitting procedure and one influenced by nuisance model for predicting the treatment. Instead the
standard forecasting methods. We ensemble these two meth- treatment is provided directly to the effect model without
ods with a geometric mean, where cf indicates cross fitting residualization.
and f indicates forecasting • No Cross Fitting: Cross fitting is applied to the DML
v Forecaster as described in Section 3. For our ablation
u f 1 − dt ψ(z) f 1 − d ψ(z) study, we create variants of sDML and DML models
u
t
qbt = tqtcf · qtf (7) without cross fitting (sDML-no cf and DML-no cf).
cf ff
1 − dt 1 − dt
f
We have chosen these models to represent the vari-
In the cross fitting procedure we pass the odd (even) batches ety of approaches typically deployed for such prob-
to the even (odd) nuisance models, and then receive an in- lems (Januschowski et al. 2020): (i) local forecasting models
ference from the effect model. We do this to account for po- 4
We could have assumed an outcome function which depends
tentially overfit models. neither non-linearly nor log-linearly on treatment, and used a
The standard forecasting practice is to use the model learned weighted sum of the raw output of the effect transformer
trained on old data to infer future outcomes. To implement as the final output, with the small modification of also providing
this we pass even (odd) batches to the even (odd) nuisance the nuisance outputs and true discounts to the effect transformer.
models, and then pass the output to the effect model. This Such a model showed similar metrics in preliminary experiments.
(SARIMAX), (ii) econometric approaches (TWFE) and (iii) demand demand (off policy) TF DML Forecaster discount
global, transformer-based forecasting methods. no-discount scenario full-discount scenario
For the accuracy metrics, we use standard metrics mean 40
125
discount in %
absolute error (MAE) and mean squared error (MSE) (Hyn- 30
100
demand
dman and Athanasopoulos 2017), and the so-called demand 20
75
error, a metric that captures the down-stream pricing depen- 10
dency (see Kunz et al. (2023a)): 50
0
50 60 70 50 60 70
v
week week
u P Pt+h
u bi (q̂i,T − qi,T )2
DT,h = t i PT =t+1 Pt+h . (8)
2
i T =t+1 bi qi,T Figure 2: Synthetic demand time series (black), the associ-
Here, t is the last timepoint in the training set, h is the fore- ated realized discount (green) and off-policy forecasts for
cast horizon, q̃i,T is the prediction for article i at timepoint DML Forecaster (blue) as well as TF (cyan).
T , qi,T is the corresponding true demand and bi is the rec-
ommended retail price of article i.
Here treatment effects ei are article dependent, but constant
Hyperparameter Tuning over time. Note however that elasticity will not be constant
over time: ϵi,t = (b)ei pit .
The following provides an overview on how we select the qit +pit ei
(b)
hyper-parameters. More details are in Appendix D. The base demand qit is
the product of two time depen-
Synthetic Dataset We use Bayesian optimization (Akiba dent components: a noisy trend τit that either leads to a lin-
et al. 2019) to tune key hyperparameters of the DML Fore- ear increase/decrease of demand over the course of the arti-
caster and TF. To mimic a realistic tuning, we use the data cle life cycle, and a seasonality term sit :
of the first 50 weeks of our simulated data whereas we keep b
qit = (τit · sit + 1) · (ci λit + ηit ). (10)
weeks 46-50 as a hold out set to select the best hyperparam-
The seasonality has a period of 30 weeks with an article-
eters, and thus use the first 45 weeks for training. In the case
dependent phase shift in order to simulate different sea-
of DML-no cf, we reuse the same hyperparameters found for
son types. In addition, we scale our time-dependent com-
the DML Forecaster. For sDML and sDML-no cf, we only
ponent with an article-specific factor ci as well as indepen-
need to re-tune the effect model, as the nuisance outcome
dent additive- and multiplicative noise (ηit and λit respec-
model is used the same way as in our DML Forecaster.
tively). Note, because of the product form in (10), our sim-
Real-World Data Both nuisance models have an input di- ulated noise is scale dependent on the base demand. Given
mension of 66, with multiple attention layers in encoder and our recipe to generate demand, we initialize the simulation
decoder, and 22 attention heads. The batch size for all nui- for each article i at week t = 0 by setting an initial stock
sance models is 1200 time series windows, and each had a and price pi1 . Our goal is to clear the given stock at t = 99,
learning rate scheduler of the form lrn 7→ exp(α) · lrn := the season end. We therefore simulate a pricing policy that,
lrn+1 , where lrn is the learning rate in the nth training step. at any given week t > 3, computes the average demand over
For the effect model we use twice the batch size as for the the past four weeks (t − 1, t − 2, . . . , t − 4). We then use
nuisance models (2400) which is due to the cross-fitting pro- this estimate to predict the week number at which the given
cedure (see Section 3). Moreover, we use a simple learning article i will run out of stock by mere linear extrapolation.
rate scheduler of the form If we estimate to clear our stock after t = 99, we decrease
lr our price by 10% w.r.t. our base price pi0 in order to set pit .
lr 7→ √ := lrn Conversely, if we expect to clear stock before season end,
n+1 we increase pit by 10%.5
where lrn is the learning rate after the nth training step and Importantly, using such a pricing strategy, treatment is
lr is the initial learning rate. confounded by the long-term seasonal pattern of simulated
demand (see example time series in Fig. 2). This leads to
Experiments on Synthetic Data higher article discounts when the seasonal component of the
We start by providing a high-level overview on the construc- simulation is low (Fig. 2, left panel) and lower discounts
tion of synthetic data to evaluate our approach in a controlled when seasonal demand is high (Fig. 2, right panel). We
setting (see Appendix E for further details on the data gen- chose a total of four different periods for training: weeks
erating process). 20-65, 30-75, 40-85, as well as weeks 50-95 and evaluate
We simulate entire life cycles (100 weeks, typical in the alternative methods on the five weeks that follow each train-
online fashion industry) of around 4500 stock keeping units. ing interval (weeks 66-70, 76-80, 86-90 and weeks 96-100
Demand in a given week t of article i qi,t is a linear function respectively). The evaluation consists of two parts: on-policy
of price pi,t and an article specific factor ei (treatment effect) evaluation, where we predict demand under the pricing pol-
(b) icy used in the simulation, as well as off-policy evaluation,
as well as a base demand qit , i.e.
5
We will open-source the data and data generation process (im-
(b)
qit = qit + pit ei (9) plemented in (Alexandrov et al. 2020) as part of the publication.
MAE MSE MAE effect MSE effect
Model type Off policy On policy Off policy On policy
TF 16.3±0.5 11.5±0.4 745.7±38.6 490.6±19.4 45.8±1.0 3350.4±164.6
DML 12.4±0.7 10.0±0.7 658.6±40.6 472.9±33.9 25.0±1.7 1743.9±187.7
DML-no cf 12.4±0.7 10.1±0.7 663.2±49.0 473.6±33.4 22.9±2.7 1458.2±212.9
sDML 20.5±0.5 11.0±0.7 922.3±34.7 501.8±36.0 89.1±0.7 10356.5±251.9
sDML-no cf 20.5±0.6 11.0±0.7 919.4±37.2 499.8±35.7 89.5±1.1 10424.0±219.8
Table 1: Error metrics predicting out-of-sample demand in study of 4500 simulated articles. See text for further details.
where we predict demand under five alternative discount lev- Cyberweek: Off-policy Discount Increase
els that range from 0-50% discount (w.r.t our initial price
pi0 ). We repeat training and inference of all models five One way to test the price response of the models consid-
times to compute empirical standard deviations. In Fig. 2 ers certain time periods where the discount policy follows
we show off-policy predictions of the DML Forecaster and a shifted distribution. In particular, cyber week is such a
TF when applying 0% discount to weeks 65-70 (left panel) yearly event when many articles have discounts that are
and the full discount (50%) respectively (right panel). much higher than normally seen during the year. For exam-
In addition to computing the standard metrics MAE and ple, in Fig. 4 we look at the difference in discounts in cyber
MSE on on- and off-policy ground truth, we also report how week 2022 versus two weeks prior at the article level and
accurately our methods predict the treatment effect (MAE we see a general right-ward shift, which indicates the gen-
effect and MSE effect) – as this parameter is directly mod- eral increase in discounting on this special week. Naturally,
eled and inferred by each of the alternative models. similar discount ranges occur during the same week in years
prior, so it would not be an interventional test if each model
Model adaptations for this simulation study Because saw these discount-time distributions in training. In order to
we deviate from the real-world constant elasticity assump- test our hypothesis, we therefore discard cyber week, cyber
tion, we adapt the head effect model as introduced in Eq. (6) week −1, and cyber week +1 from our training data and
accordingly, i.e. our final output is computed as replace them with a set of 3 consecutive weeks that are re-
qbt = qet + ψ(z) · (dt − det ) (11) sampled from the same article. We refer to data sets with
discarded and replaced weeks as off-policy and to data sets
where qet , dt , and det are defined as in Eq. (6). without this replacement as on-policy.
We change TF accordingly, i.e. the head is computed as We validate each model on 2021 cyber week, both on-
in Eq. (11), but we set det = 0. and off-policy, and test each model on 2022, 2020, and 2019
Results We find that our DML Forecaster (DML) consis- cyber weeks on- and off-policy. Each experiment has a fore-
tently outperforms TF when it comes to predicting demand cast horizon of cyber week and cyber week +1, while train-
under off-policy price changes (see Table 1). On policy, the ing on two years of article histories up until cyber week −1.
difference between both models is not significant6 . Further- The number of articles at inference time was 410, 500 for
2022, 208, 212 for 2020, and 144, 980 for 2019. We give an
6
W.r.t. computed empirical standard deviations in depth qualitative description of our data in Appendix F.
Off policy On policy
Target Date Metric DML Forecaster TF SARIMAX DML Forecaster TF SARIMAX
21-11-2022 Demand Error 61.48 80.03 81.33 60.00 54.73 81.08
23-11-2020 65.94 88.05 78.98 62.50 57.37 78.75
25-11-2019 63.61 63.18 73.03 61.85 57.29 69.59
21-11-2022 MAE 7.739 10.14 9.99 7.606 6.931 9.96
23-11-2020 12.92 17.39 14.82 12.39 11.34 14.78
25-11-2019 12.68 12.52 14.31 12.19 11.40 13.94
21-11-2022 MSE 2047 2540 2225 1903 1940 2196
23-11-2020 5018 7361 4891 5092 5032 5062
25-11-2019 5075 5630 4348 5446 5448 4472
Table 2: Table of metrics for experiment dates considering TF and DML Forecaster for both off and on policy evaluation for
cyberweek. All models were trained with an L1 loss function. Metrics read from the test epoch output.
9
C Elastiticity-based Forecasts
A standard approach to compute demand functions uses estimated elasticities, ε (see for instance Varian (2014), Deaton and
Muellbauer (1980) or Phillips (2021)).
Here, we use estimated elasticities for different groups of articles and past observed demand in order to predict demand for
a given week t, i.e. ε
1 − di,t
q̂i,t = qi,t−1 .
1 − di,t−1
Due to hidden confounding (e.g., seasonality and advertisement campaigns), naı̈ve regression of log(qi,· ) onto log(1 − di,· )
generally results in biased estimates of the elasticities. To account for this, we use a two-way fixed-effects Poisson regression
model that is defined as follows:
log E[qi,t ] = ε log(1 − di,t ) + ui + ct . (12)
Here, the parameter ui is the article-specific effect and ct is the week-specific effect. This model is fitted with the standard
within estimator using the R package fixest (Bergé 2018).
E Simulation Study
In the following, we give more details about the generation of our synthetic data set. Note that, in the main manuscript, we keep
our presentation concise and on a high level by omitting the weighting we use for individual components and using a different
(but equivalent) parameterization of the noise model.
We simulate a total of 4467 stock keeping units over a period of 100 weeks (i.e t ∈ {0, 1, . . . , 99}). Demand in a given week
t of article i (qi,t ) is a linear function of price pi,t and an article-specific factor ei (treatment effect) as well as a base demand
(b)
qit , i.e.
(b)
qit = qit + pit ei (13)
(b)
The base demand qit is the product of two time dependent components: a noisy trend τit that either leads to a linear in-
crease/decrease of demand over the course of the article life cycle, and a seasonality term sit :
b
qit = (0.15 · τit + 0.25 · sit + 1) · cit . (14)
cit is the article-specific contribution that consists of two sub components ait and bit :
cit = 0.05 · a2it + 0.25 · ait + 0.5 · bit (15)
where
ait = αd(i) + ϵit (16)
and
αd ∼ N (10, 32 ) (17)
is sampled once for each category d ∈ {1, 2, . . . , 45}. Furthermore ϵit is independent noise drawn from N (0, 1) and d(i) is a
(random) mapping that assigns article i to one of a total of 45 categories.
The contribution of bit is computed analogously to ait using a different setting of hyperparemeters and a total of 15 categories:
bit = βk(i) + ψit , ψit ∼ N (0, 52 ) (18)
2
βk ∼ N (300, 50 ), k ∈ {1, 2, . . . , 15}. (19)
The treatment effect ei in Eq. (14) depends on a random component as well as an article-specific component:
(b) (b)
ei = ei · 0.15 · āi , ei ∼ max(1.3, LN (0.75, 0.1252 )) (20)
1
P99
where āi = 100 t=0 ait
Furthermore we chose our initial price pi0 such that we avoid qit < 0 at any week t:
q̄i q̄i 2
pi0 ∼ N , , (21)
3 1.5
1
P99 (b)
where q̄i = 100 t=0 qit . We apply additional filtering steps to exclude articles that have negative demand from our synthetic
data set. The seasonal component sit in Eq. (14) is a sine function with a period of 30 (weeks) and article-dependent shifts
(season types) that are tied to our categorical variable k. In particular, we subdivide the values of k evenly into six subgroups
and sample season shifts for each group uniformly over all integers in the interval [−15, 15].
Lastly, our trend component τit follows a (noisy) linear function, i.e.
τit ∼ N (t · γi , στ2i ) (22)
where
γi ∼ U([−0.02, 0.02]) (23)
and
στi ∼ U([0, 0.15]). (24)
With this demand model in place we set our initial stock z0 such that we clear our simulated inventory in week t = 99 at an
average discount rate of 14%, i.e
99
1 X (b)
z0 = q · (1 − 0.14)pi0 ei . (25)
100 t=0 it
Moving along, we compute demand for the first four weeks, i.e qit for t ∈ {0, 1, 2, 3} keeping the price constant (pi0 = pi1 =
pi2 = pi3 ). For all weeks t we update our stock accordingly:
zt+1 = zt − qit (26)
At any given week t ∈ {4, 5, . . . 100}, we compute the expected number of weeks until we run out of stock (mt ), via a basic
linear extrapolation:
4zt
mt = Pt , (27)
tj =t−3 qitj
Table 3: Overview of syntetic dataset and its usage with the DML Forecaster and TF
Table 4: Overview of real-world dataset and its usage with the DML Forecaster and TF
C HARACTERISTIC S YNTHETIC C YBER WEEK 2019 C YBER WEEK 2020 C YBER WEEK 2022
Figure 5: A sample of the difference in forecasting error for the TF vs. DML Forecaster on cyber week 2022, measured on the
off-policy experiment.
networks we present here. Thus, during training, we update the parameters of the embedding layers as part of the same gradient
update that we use to optimize the remaining weights of each model.
We treat discount as a continuous variable – even though inventory managers typically reduce prices by increments of five
percent relative to some baseline (black price). In practice, we need a higher resolution as discounts are recorded as weekly
averages. Depending on the time a discount is updated this can result in rather arbitrary decimals. For instance, if the discount
for a given fashion item is increased from 20% to 25% in the middle of a given week, we would record a an aggregated discount
level of 22.5%.
Further Results
The main cyber-week results are given in Section 4 with the comparison to the control dates given in Table 6. On the 2022 and
2020 cyber weeks, the DML Forecaster performed substantially better across all metrics when compared to the TF model. As
we expected, the TF performed mildly better than the DML Forecaster on policy for the cyber weeks. On the 2019 cyber week,
the DML Forecaster and TF models performed similarly in the off policy test, and the TF was once again better on policy.
To understand the magnitude of the results with regard to the experiment design, we look at Table 6. Indeed, the degree of
change in the error metrics for the TF is not drastic when moving from on policy to off policy for each specific date, when
compared to the 2022 and 2020 cyber week dates. The degree of change in the error metrics for the DML Forecaster when
considering the on-to-off policy shift is similar to the cyber week dates.
Table 6: Table of metrics for control dates. We consider baseline and DML Forecaster for both on policy and off policy evalua-
tion. All models were trained with an L1 loss function. Metrics read from the output of the test epoch.
Table 7: Demand error for the cross-fitting DML Forecaster compared to the sample-splitting DML Forecaster on the cyber
week and control week target dates. Cf = Cross-fitting, Ss = Sample-splitting. Training was done on an AWS Sagemaker
instance of type G4dn.4xlarge. Training time is in hours.
without cross-fitting and compare to the original, on the start date of 21-11-2022, for both in and out of sample performance.
More precisely, we use the even batches with the even nuisance models to provide inference for effect model training, and
similarly for the odd version.
As a second experiment, there is a “simplified” version of DML Forecaster (sDML), that does not orthogonalize the treatment
function, see for example (Chernozhukov et al. 2018, Equation 1.3). Equation 6 simply becomes
ψ(X)
qbt = qet 1 − dt 1 ≤ t ≤ s, (31)
where dt is the desired discount. We test this model on the same start date of 21-11-2022.
As a final experiment, we test the sDML model without cross-fitting.
The goal is that this ablation study will help explain the mechanism by which DML Forecaster improves upon the TF for
out-of-sample forecasting.
There was not a significant difference between cross-fitting and no cross-fitting for either type of DML Forecaster. The
sDML model performed worse on- and off-policy for Demand Error and MAE, regardless of cross-fitting. This is explained by
difference in error between the final output and the outcome model. The effect model in the regular DML Forecaster corrected
the Demand Error by 6.65 resp. 2.57 for off- resp. on-policy over the Demand Error of the outcome model. However, For the
sDML model, this correction was significantly less for off policy, and the effect model even increased the error for on-policy.
See Table 8 for the results.
Table 8: Metrics for ablation study for the target date of 21-11-2022. DML-no CF is DML with no cross-fitting, sDML is the
simplified DML without the discount residual. Best Metrics for each block are bold. The effect error correction column is the
difference in error between the outcome model and the output of the effect model (negative numbers indicate a reduction in
error from the outcome model). Metrics read from the output of the test epoch.
3.10 installed.
H Source code
We provide the source code for creating the synthetic data set (sim sub 1.zip) as well as the source code to our models and
their application to simulated data (dml-on-synthetic-data.zip) as part of this submission. We refer to the Readme.md in each
package for further information.
Baseline Parameter Cyberweek Data Simulated Data Notes
Number of layers 6 13 number of residual blocks in the en-
coder and decoder
Hidden dimension 274 51 number of hidden dims in ffn after each
attention step
Head hidden dimension 164 62 in the network that computes the nor-
malized demand slopes
Dropout 0.41 0.43 in all ffns and attention weights
Learning rate 0.0034 0.0224 for the AdamW optimizer
−4
Weight decay 0.037 1.4 × 10 regularization parameter
Beta 1 0.8044 0.8566 decay rate for computing moving aver-
age of gradient in the Adam optimizer
Beta 2 0.6023 0.9140 decay rate for computing moving aver-
age of gradient in the Adam optimizer
Number of linear pieces 2 1 number of pieces for the demand curve
Train epochs 2 23
Total parameters 1.3M 124K
Table 9: Hyperparameters for the TF with production data (cyberweek) and simulated data
DML Parameter Cyberweek Data Simulated Data Notes
Outcome model
Number of layers 3 5 number of residual blocks in encoder and de-
coder
Hidden dimension 130 73 number of hidden dims in ffn after each atten-
tion step
Dropout 0.2 0.1 in all ffns and attention weights
Learning rate 0.0096 0.0088 for the RAdam optimizer
−4 −9
Weight decay 1.4 × 10 5.4619 × 10 regularization parameter
Beta 1 0.7096 0.8566 decay rate for computing moving average of
gradient in the Adam optimizer
Beta 2 0.8585 0.9140 decay rate for computing moving average of
gradient in the Adam optimizer
Gamma 0.9993 0.9388 Exponential decay of learn rate
Train epochs 1 44
Treatment model
Number of layers 2 2 number of residual blocks in encoder and de-
coder
Hidden dimension 160 21 number of hidden dims in ffn after each atten-
tion step
Dropout 0.2 0.1497 in all ffns and attention weights
−4
Learning rate 4.2 × 10 0.0162 for the RAdam optimizer
−4 −9
Weight decay 5.8 × 10 2.6 × 10 regularization parameter
Beta 1 0.7884 0.5977 decay rate for computing moving average of
gradient in the Adam optimizer
Beta 2 0.9536 0.8740 decay rate for computing moving average of
gradient in the Adam optimizer
Gamma 0.9995 0.9549 Exponential decay of learn rate
Train epochs 1 60
Effect model
Number of layers 5 6 number of residual blocks in encoder and de-
coder
Hidden dimension 273 43 number of hidden dims in ffn after each atten-
tion step
Dropout 0.412 0.1750 in all ffns and attention weights
Learning rate 1.07 × 10−8 0.0491 for the RAdam optimizer
−9
Weight decay 0.0375 5.75 × 10 regularization parameter
Beta 1 0.6 0.6405 decay rate for computing moving average of
gradient in the Adam optimizer
Beta 2 0.7044 0.6749 decay rate for computing moving average of
gradient in the Adam optimizer
Train epochs 1 20
Total parameters 1.2M 118K
Table 10: Hyperparameters for the DML Forecaster with production data (cyberweek) and simulated data