By Class of Summer Term 2019 | June 18, 2019

Double Machine Learning Implementation

Christopher Ketzler*, Guillermo Morishige*

Abstract: The aim of this paper is to replicate and apply the approach provided by Chernozhukov et al. (2016) to get the causal estimand of interest: average treatment effect (ATE) $\ \eta_0 $ using Neyman orthogonality and cross-fitting. For observational data, we will estimate the causal relationship between the eligibility and participation in the 401(k) and its effect on net financial assets; as well to apply it to other datasets, to find the effect of the Pennsylvania Reemployment Bonus on the unemployment duration and the effect of smoking on medical costs. As proposed by Chernozhukov’s Double/Debiased Machine Learning (DML) framework, we will estimate the causal effects of binary treatments on an outcome, the regression parameter in a partially linear regression model. With use of machine learning (ML) methods to estimate the nuisance parameters $\ \eta_0 $ : the dependency of the confounding factors (controls) with respect to the outcome and the treatment assignment.

Keywords: Double machine learning, average treatment effect, Neyman-orthogonality, cross-fitting, partially linear regression model.

School of Business and Economics, Humboldt Universität zu Berlin, Spandauer Str. 1, 10178, Berlin, Germany.

1) Introduction

People in the fields of econometrics, epidemiology, philosophy, just to name a few, have been interested into modelling causality: drawing conclusion through statistical analyses from associations between measurements. Although getting inferences from these statistical analyses could be tricky since the association (correlation) doesn’t imply causation. The word “causation” started to appear in settings of randomized experiments by Neyman (1923). Fisher (1935) stressed the importance of randomization as the basis for inference. Rubin (1974) takes it to a non-random assignment mechanism which could apply not only to experimental data, but also observational.
As computational power increased, innovations in statistical inference followed. New statistical approaches were developed and now these robust models could handle big sets of data with an extensive number of semi-parametric covariates:

In 2016, Victor Chernozhukov et al. introduced the “Double/ Debiased Machine Learning for Treatment and Structural Parameters” to solve the classic semiparametric problem of inference on a low parameter $\ \theta_0 $ in the presence of a high-dimensional nuisance parameter $\ \eta_0 $. A nuisance parameter represents an intermediate step for computing the parameter of interest. In this case, the treatment effect on a certain variable denoted by $\ \theta_0 $ is of interest. Victor Chernozhukov et al. estimate the nuisance parameters through machine learning estimators.

Only machine learning models are applicable, which able to handle high dimensional cases, meaning that the entropy of the parameter space is increasing based on the sample size in a sufficiently small way (traditional framework). The following predictors are employed: Random Forest, Lasso, Ridge, Deep Neural Networks, Boosted Trees, and ensemble models based on at least one of them. This approach reduces the effect of easily overfitting and find a suitable trade-off between regularization and bias. By cross-fitting and using Neyman-orthogonal moment functions/ score functions Double/ Debiased Machine Learning reduces the bias, to get a closer estimate for the treatment effect $\ \theta_0 $. Neyman-orthogonal functions have a lower sensitivity with respect to nuisance parameters while estimating treatment effects $\ \theta_0 $.

This blog provides a rough overview of Victor Chernozhukov et al. Double Machine Learning approach. We do not aim to clarify all aspects. For further explanations, refer to the paper Chernozhukov et al (2016). The objective of our work is the implementation of Double Machine Learning approach in Python. Therefore, the blog is structured as followed: In section 2) we will make reference to developments in the machine learning field for average treatment estimation purposes. Section 3) provides a deeper insight into DML. In section 4) contains the empirical test of our code and interpretation of results.

2) Literature Review

For unconfounded assignment of the treatment effects there are a number of approaches that have been used through the development of statistical inference. Using the inverse of nonparametric estimates of the propensity scores for treatment effect estimations was an idea introduced by Hirano, Keisuke, et al. (2003). Elizabeth Stuart (2010) considered a wide range of matching methods to best compare the treatment effect between groups with covariates in common for an unbiased comparison. Knaus, et. al. (2018) used machine learning to simulate data generation processes (DGPs).

The machine learning estimators have been used for the estimation of heterogenous causal effects across different disciplines. The approaches with the different machine leraning methods are: regression trees by Su, et. al. (2009), random forests by Wagner and Athey (2018), lasso by Qian and Murphy (2011), support vector machines by Imai and Ratkovic (2013), boosting by Powers, et. al. (2018), neural networks by Johansson, F., et. al. (2016).

Specifically focused in developments of the Double Machine Learning, we can find an applied study by Knaus (2018): A Double Machine Learning Approach to Estimate the Effects of Musical Practice on Student’s Skills. He used the dataset of the German National Economic Panel Study (NEPS) Blossfeld and von Maurice (2011).

Chernozhukov et al (2016) also provided extensions to the model that are not going t be implemented by us. They proposed using instrumental variables (IV) in the partially linear model. He also estimates the average treatment effect on the treated (ATTE) and the local average treatment effects (LATE).

3) Double/ Debiased Machine Learning

3.1) Partially Linear Model

The mathematical model that describes the estimation problem is a partially linear equation as suggested by Robinson (1988). It is assumed that the treatment effects are fully heterogenous and the treatment variable is binary, $\ D \in{0,1}$. We consider the vectors $\ (Y,D,X)$, where $\ Y $ are the outcome variables, $\ D $ the treatment variable, and $\ X $ are the covariates.

$\ Y = D\theta_0 + g_(X) + U, E[U|X,D]=0 $

[Eq. 1.1]

$\ D = m_0(X)+V, E[V|X]=0 $

[Eq. 1.2]

$\ U $ and $\ V $ are disturbances. Our variable of interest is $\ \theta_0 $, the average treatment effect. The nuisance parameter is: $\ \eta_0 = (m_0,g_0) $. The nuisance parameters are estimated using machine learning methods caused by the nonparametric nature of the variables in the covariates.

3.2) Naïve Estimator

A simple way to estimate the treatment effect is to construct a sophisticated machine learning estimator, i.e. Random Forest, and to learn the following regression function: $\ D\theta_0+g_0(X) $. Where the data is split into two parts with $\ i\in I $ and an auxiliary part with length $\ N-n $ . Then one solves the following equation to get the treatment effect:

$\ \hat{\theta_0} = (\frac{1}{n} \sum_{i \in I}D_i^2)^{-1} \frac{1}{n} \sum_{i \in I}D_i (Y_i-\hat{g}_0(X_i)). $

[Eq. 2.1]

By decomposing the scaled estimation error in the treatment effect ($\ \theta_0 $) one can visualize the impact of the bias while learning the ml estimator $\ g_0 $.

$\ \sqrt{n}(\hat{\theta}_0 - \theta_o) = \underline{ (\frac{1}{n} \sum_{i \in I} D_i^2)^{-1} \frac{1}{\sqrt{n}} \sum_{i \in I} D_i U_i } + (\frac{1}{n} \sum_{i \in I} D_i^2)^{-1} \frac {1}{\sqrt{n}} \sum_{i \in i} D_i (g_0(X) - \hat{g}_0(X_i))$

[Eq. 2.2]

Where term $\ a $ (underlined), under mild conditions, follows: $\ a \to N(\theta, \sum^-) $ and $\ b $ is the regularization bias term.

Fig.1: The left panel visualizes the simulated distribution of the treatment effect computed by a conventional (non-orthogonal or naive) ML estimator. The estimator is badly biased, meaning that the distribution is shifted to much to the right. The right panel shows the bahavior of an orthogonal, DML estimator, which is unbiased.

Given Estimator	Nuisance Estimator	Mean	Median	Standard deviation
Naïve Estimator	Random Forest	13226.91	13196.08	1878.16
Neural Network	Neural Network	-5577.51	1111.23	128477.81
Dictionary Learning	Lasso	5506.19	5532.03	114.51
Decision Tree	Decision Tree	16872.88	16830.74	1752.67
Extra Trees	Decision Tree	16845.95	16927.08	1784.50

Given Estimator	Nuisance Estimator	Mean	Median	Standard deviation
Naïve Estimator	Random Forest	-0.090	-0.092	0.046
Neural Network	Neural Network	-0.085	-0.084	0.010
Dictionary Learning	Lasso	-0.104	-0.104	0.001
Decision Tree	Decision Tree	-0.092	-0.091	0.023
Extra Trees	Decision Tree	-0.084	-0.087	0.021

Given Estimator	Nuisance Estimator	Mean	Median	Standard deviation
Naïve Estimator	Random Forest	24392.85	24366.45	949.12
Neural Network	Neural Network	31347.34	31324.24	1147.35
Dictionary Learning	Lasso	23870.46	23841.01	94.51
Decision Tree	Decision Tree	31483.46	31344.05	1082.80
Extra Trees	Decision Tree	31502.27	31541.10	1171.23

Categories

course-projects (37)

instruction (2)

Tags

a/b-testing

albert

attention

awd-lstm

bayesian-deep-learning

bayesian-topic-modelling

bert

bilm

binary

black-box

blockchain

causal-inference

class17/18

class18/19

class19

class19/20

classification

cnn

coarsened-exact-matching

conversion

convolutional-neural-networks

credit-risk

data-simulation

deep-learning

deeplearning

distant-transfer-learning

dml

doc2vec

document-embeddings

economicuncertainty

elmo

embeddings

explanation

fasttext

fine-tuning

genetic-matching

glove

gpt-2

gru

hierarchical-network

ice

image-analysis

image-captioning

imbalanced-data

inference

ite

keras-imdb-dataset

knn-algorithm

language-model

language-modeling

language-modelling

lda

lime

long-short-term-memory

lstm

machine-learning

matching-methods

matchit

monte-carlo-dropout

movie-reviews

nearest-neighbor

neural-network

neural-networks

nlp

nn

optimal-matching

oversampling

pdp

pretraining

propensity-score

propensity-score-weighting

recommendation

recommender-system

recommender-systems

rnn

roberta

rs

sentiment-analysis

sentiment-classification

seq2seq

share-price-prediction

simpletransformers

simulation

survival-analysis

text-analysis

text-classification

text-generation

text-mining

text-summarization

time-series

time-series-forecasting

toxic-comments

transfer-learning

transformers

treatment-effect

twitter

ulmfit

uncertainty

uplift

uplift-modeling

uplift-modelling

variational-inference

wikitext-103

word-embeddings

Fig.3.1: treatment effect distribution: Naive Estimator	Fig.3.2: treatment effect distribution: given Estimator: Dictionary Learning; Nuisance Estimator: Lasso
Fig.3.3: treatment effect distribution: given Estimator: Decision Tree; Nuisance Estimator: Decision Tree	Fig.3.4: treatment effect distribution: given Estimator: Neural Network; Nuisance Estimator: Neural Network
Fig.3.5: treatment effect distribution: given Estimator: ExtraTreesRegressor (Ensemble); Nuisance Estimator: Decision Tree

Fig.4.1: treatment effect distribution: Naive Estimator	Fig.4.2: treatment effect distribution: given Estimator: Dictionary Learning; Nuisance Estimator: Lasso
Fig.4.3: treatment effect distribution: given Estimator: Decision Tree; Nuisance Estimator: Decision Tree	Fig.4.4: treatment effect distribution: given Estimator: Neural Network; Nuisance Estimator: Neural Network
Fig.4.5: treatment effect distribution: given Estimator: ExtraTreesRegressor (Ensemble); Nuisance Estimator: Decision Tree

Fig.5.1: treatment effect distribution: Naive Estimator	Fig.5.2: treatment effect distribution: given Estimator: Dictionary Learning; Nuisance Estimator: Lasso
Fig.5.3: treatment effect distribution: given Estimator: Decision Tree; Nuisance Estimator: Decision Tree	Fig.5.4: treatment effect distribution: given Estimator: Neural Network; Nuisance Estimator: Neural Network
Fig.5.5: treatment effect distribution: given Estimator: ExtraTreesRegressor (Ensemble); Nuisance Estimator: Decision Tree

Double Machine Learning Implementation

1) Introduction

2) Literature Review

3) Double/ Debiased Machine Learning

3.1) Partially Linear Model

3.2) Naïve Estimator

3.3) DML1 - Algorithm

3.4) DML2 - Algorithm

3.5) Sample Splitting

4) Empirical Examples

The 401 (k) plan

Pennsylvania Reemployment Bonus experiment dataset

The Medical Cost Personal Dataset

5) Conclusion

References

Categories

Tags

Implementation of the Double/ Debiased Machine Learning Approach in Python

Double Machine Learning Implementation

1) Introduction

2) Literature Review

3) Double/ Debiased Machine Learning

3.1) Partially Linear Model

3.2) Naïve Estimator

3.3) DML1 - Algorithm

3.4) DML2 - Algorithm

3.5) Sample Splitting

4) Empirical Examples

The 401 (k) plan

Pennsylvania Reemployment Bonus experiment dataset

The Medical Cost Personal Dataset

5) Conclusion

References

Categories

Tags