data("SchoolingReturns", package = "ivreg")
my_data <- SchoolingReturns[, 1:8]
Rows: 3,010
Columns: 8
$ wage <dbl> 548, 481, 721, 250, 729, 500, 565, 608…
$ education <dbl> 7, 12, 12, 11, 12, 12, 18, 14, 12, 12,…
$ experience <dbl> 16, 9, 16, 10, 16, 8, 9, 9, 10, 11, 13…
$ ethnicity <fct> afam, other, other, other, other, othe…
$ smsa <fct> yes, yes, yes, yes, yes, yes, yes, yes…
$ south <fct> no, no, no, no, no, no, no, no, no, no…
$ age <dbl> 29, 27, 34, 27, 34, 26, 33, 29, 28, 29…
$ nearcollege <fct> no, no, no, yes, yes, yes, yes, yes, y…
> summary(lm(formula = wage ~ education, data = my_data))
Call:
lm(formula = wage ~ education, data = my_data)
Residuals:
Min 1Q Median 3Q Max
-576.09 -173.36 -34.12 127.82 1686.25
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 183.949 23.104 7.962 2.38e-15 ***
education 29.655 1.708 17.368 < 2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 250.7 on 3008 degrees of freedom
Multiple R-squared: 0.09114, Adjusted R-squared: 0.09084
F-statistic: 301.6 on 1 and 3008 DF, p-value: < 2.2e-16
两阶段最小二乘估计分为两个阶段,第一阶段是将自变量的变异分解,使用工具变量对暴露因素建立回归;第二步再通过暴露因素预测值(predicted value,P)构建和结局变量Y之间的回归方程。
比如说这里我们发现有一个变量,即距离学校距离是否近。我们发现,与学校距离远近,不仅与教育程度(接受教育年份,暴露因素X)相关,它同样与未来收入(薪资,结局变量Y)相关。那么,这里学校距离远近就是一个潜在的工具变量。