工具变量 两阶段的协变量
I love papers that make you go “This is obvious in hindsight. Why did nobody try it before?” Some NLP folks made this remark about Transformers (“Attention Is All You Need”). To me, “The Blessings of Multiple Causes” evokes a similar feeling: why didn’t people create synthetic instruments?
我喜欢那些能助您一臂之力的论文。 为什么以前没有人尝试过?” 一些NLP人士对《变形金刚》发表了这样的评论(“ 注意就是你所需要的 ”)。 对我来说,“ 多重原因的祝福 ”让人想起类似的感觉:人们为什么不创造合成工具?
先决条件 (Prerequisites)
This article assumes basic familiarity with causal DAGs and causal inference. If you are unfamiliar with those terms, read the last third of my previous article on causality to get a brief overview.
本文假定对因果DAG和因果推理有基本的了解。 如果您不熟悉这些术语,请阅读上一篇有关因果关系的文章的后三分之一,以进行简要概述。
Additionally, while there are many ways of creating synthetic instruments, this article uses probabilistic PCA as it’s one of the most generic methods. If you don’t know much about PCA, I suggest reading this wonderful and intuitive explanation.
此外,尽管有很多创建合成工具的方法,但是本文使用概率PCA作为最通用的方法之一。 如果您对PCA不太了解,我建议您阅读此精彩而直观的说明 。
什么是仪器? (What Are Instruments?)
Instrumental variables have been around for a long time and can be said to be one of the backbones of econometrics. We are concerned with this DAG:
工具变量已经存在很长时间了,可以说是计量经济学的骨干之一。 我们关注此DAG:
U is an unobserved confounder. We want to estimate the causal effect of Y on Z, denoted Y → Z. By this DAG, we can estimate X → Y and X → Z. Assuming linearity, we have X → Z = (X → Y) (Y → Z) and we can solve for Y → Z algebraically.
你是一个不可观察的混杂因素。 我们想估计Y对Z的因果效应,表示为Y→Z。通过此DAG,我们可以估计X→Y和X→Z。假设线性,我们有X→Z =(X→Y)(Y→Z ),我们可以代数求解Y→Z。
In this setup, we call X an instrumental variable if:
在此设置中,如果满足以下条件,我们将X称为工具变量:
- X is strongly correlated with Y (strong first stage) X与Y紧密相关(强烈的第一阶段)
- X affects Z only through Y (exclusion restriction) X仅通过Y影响Z(排除限制)
In practice, finding an instrument is difficult. If X is only weakly correlated with Y, then the causal effect estimates can be severely biased. The second condition cannot be tested or verified, so we can only debate the reasonableness of assumptions, such as “is mayonnaise an instrument?”
在实践中,很难找到一种乐器。 如果X仅与Y弱相关,则因果效应估计值可能会严重偏差。 第二个条件无法测试或验证,因此我们只能辩论各种假设的合理性,例如“蛋黄酱是否有效?”
Innovation requires us to rethink existing approaches. The cleverness:
创新要求我们重新思考现有方法。 聪明之处:
If finding instruments is hard, why not create instruments?
如果很难找到乐器,为什么不创建乐器呢?
问题设置 (The Problem Setup)
The original paper is titled “The Blessings of Multiple Causes” because multiple causation is the necessary condition to create our own instruments. The DAG looks something like:
原始论文的标题为“ 多重原因的祝福 ”,因为多重因果关系是创建我们自己的工具的必要条件。 DAG看起来像:
We want to infer the causal effect of each Y on Z in the presence of a shared unobserved confounder U. Think of the Y’s as the column names, predictors in the regression model. We assume no interference and no interaction — an oversimplification for sure, but pragmatic.
我们要推断存在共享的未观察到的混杂因素U时每个Y对Z的因果关系。将Y视为列名,即回归模型中的预测变量。 我们假设没有干扰,也没有互动-当然可以过分简化,但务实。
This DAG looks restrictive, but it’s a reasonable enough for many real-world processes (refer to the paper for examples). More importantly, counter to intuition, the problem is much easier than the single case:
这个DAG看起来很严格,但是对于许多实际过程来说已经足够合理了(请参阅本文的示例)。 更重要的是,与直觉相反,这个问题比单一情况容易得多:
This “simpler” DAG is impossible to solve. We can’t get an unbiased estimate of the causal effect.
这种“简单”的DAG无法解决。 我们无法对因果效应进行无偏估计。
Moreover, the problem assumes there is no unobserved single-cause confounder. The existence of an unobserved V can mess up our estimates:
此外,该问题假定没有观察不到的单因混杂因素。 存在一个未观察到的V会破坏我们的估计:
However, as the paper notes, assuming that no such V exists is a more comfortable assumption than ignorability. Most causal analysis using covariate adjustment assume that we have no unmeasured confounders at all. Here we assume some unmeasured common confounder, just no unmeasured single cause confounders.
但是,正如论文所指出的那样,假设不存在这样的V比可燃性更为舒适。 大多数使用协变量调整的因果分析都假设我们根本没有不可测的混杂因素。 在这里,我们假设一些无法衡量的常见混杂因素,而没有任何无法衡量的单因混杂因素。
创建乐器 (Creating Instruments)
Now we will see why multiple causation is necessary. We want to create a common instrument X. This X is a local variable, i.e. each observation has a unique X vector (more on this in a minute).
现在我们将了解为什么需要多重因果关系。 我们想创建一个通用的仪器X。这个X是一个局部变量,即每个观测值都有一个唯一的X向量(稍后会详细介绍)。
For example, we can fit PCA and select the first few principal components as our X. By construction, exclusion restriction is satisfied because the PCA does not use Z. We don’t have to squabble over this assumption.
例如,我们可以拟合PCA并选择前几个主成分作为X。通过构造,可以满足排除限制,因为PCA不使用Z。我们不必在这个假设上争论。
However, if the PCA is trained using the entire dataset, we don’t know whether or not X is a good instrument. We can get principal components out of random noise but X won’t be a good predictor of Y. Furthermore, reusing the Y’s to estimate both the instrument and causal effect is philosophically problematic and will lead to overfitting.
但是,如果使用整个数据集对PCA进行训练,我们将不知道X是否是一个很好的工具。 我们可以从随机噪声中获取主成分,但是X不能很好地预测Y。此外,重用Y来估计工具和因果关系在哲学上是有问题的,并且会导致过度拟合。
One avenue of thought is to split the dataset into two, fit PCA on one half, and perform inference on the other half. This leads to a dead end. On unseen data (the second half), our best guess for each observation is X = the zero vector, which reduces to OLS regression. The whole point of figuring out new methodology is because we know OLS doesn’t work.
一种思路是将数据集分成两部分,将PCA拟合一半,然后对另一半进行推理。 这导致死胡同。 在看不见的数据上(下半部分),我们对每个观察值的最佳猜测是X =零向量,这会减少到OLS回归。 找出新方法的全部要点是因为我们知道OLS不起作用。
The paper approaches the problem by deleting random Y’s for each observation (say, half). Classical PCA cannot handle missing values. Probabilistic PCA (PPCA), its Bayesian counterpart, handles missing values just fine.
本文通过删除每个观察值的随机Y(例如一半)来解决该问题。 经典PCA无法处理缺失值。 与贝叶斯对应的概率PCA(PPCA)可以很好地处理缺失值。
PCA models the y vector of each observation using y = Wx, where W is the matrix of loading factors and x is the vector of latent variables. PPCA assumes the existence of a Gaussian error term ε such that y = Wx + ε. The generative model:
PCA使用y = Wx对每个观测值的y向量建模,其中W是加载因子的矩阵,x是潜变量的向量。 PPCA假设存在高斯误差项ε,使得y = Wx +ε。 生成模型:
As σ² → 0, we get PCA. And if we replace σ²I with a diagonal matrix that allows the diagonal values to vary, we get factor analysis.
当σ²→0时,我们得到PCA。 而且,如果用对角线矩阵替换σ²I来改变对角线值,则会得到因子分析 。
So:
所以:
- We can estimate the latent vector X for each observation using half of the Y. By construction, X satisfies exclusion restriction. 我们可以使用Y的一半来估计每个观测值的潜在向量X。通过构造,X满足排除限制。
- We can use the deleted half to evaluate X’s predictive performance. In other words, given the observed half, can we predict the deleted half? We can test for a strong first stage. 我们可以使用删除的一半来评估X的预测效果。 换句话说,给定观察到的一半,我们可以预测删除的一半吗? 我们可以测试一个强大的第一阶段。
If X has good predictive performance, then we have a valid instrument that’s created synthetically from the Y’s (specifically, in the case of PPCA, the residuals are the instruments). Alternatively, because of the probabilistic nature of X, we can then use generalized propensity score to fit a potential outcome model and obtain the causal effect estimates.
如果X具有良好的预测性能,那么我们就有一个有效的工具,是根据Y的综合创建的(特别是在PPCA中,残差就是工具)。 或者,由于X的概率性质,我们可以使用广义倾向评分来拟合潜在结果模型并获得因果效应估计。
结束语 (Closing Remarks)
Does this sound too good to be true? Perhaps. But the synthetic controls do come with a cost: the bias-variance tradeoff.
这听起来好得令人难以置信吗? 也许。 但是综合控制确实要付出代价: 偏差方差的权衡 。
The Bayesian model to create the synthetic instrument has its own estimation uncertainty, thus increasing the variance while eliminating bias. The authors suggest following Occam’s razor. In the case of PPCA, each additional principal component increases variance, so we want to keep the fewest number of components that still results in “good” predictive performance.
用于创建综合工具的贝叶斯模型具有其自身的估计不确定性,因此在消除偏差的同时增加了方差。 作者建议使用Occam的剃刀 。 对于PPCA,每个其他主要成分都会增加方差,因此我们希望保留最少数量的成分,这些成分仍会导致“良好”的预测性能。
While this article uses PPCA, latent variable models will generally work as long as they can handle missing values. For instance, PPCA is inappropriate for count data as it assumes Gaussian errors.
尽管本文使用PPCA,但潜在变量模型通常只要可以处理缺失值就可以使用。 例如,PPCA不适合计数数据,因为它假定了高斯误差。
翻译自: https://towardsdatascience.com/synthetic-instrumental-variables-968b12f68772
工具变量 两阶段的协变量