5-两阶段最小二乘法

比如我们想要研究教育程度（接受教育年份，暴露因素X）对未来收入（薪资，结局变量Y）的影响。

这里我使用ivreg 中的数据集：

data("SchoolingReturns", package = "ivreg")
my_data <- SchoolingReturns[, 1:8]

Rows: 3,010
Columns: 8
$ wage        <dbl> 548, 481, 721, 250, 729, 500, 565, 608…
$ education   <dbl> 7, 12, 12, 11, 12, 12, 18, 14, 12, 12,…
$ experience  <dbl> 16, 9, 16, 10, 16, 8, 9, 9, 10, 11, 13…
$ ethnicity   <fct> afam, other, other, other, other, othe…
$ smsa        <fct> yes, yes, yes, yes, yes, yes, yes, yes…
$ south       <fct> no, no, no, no, no, no, no, no, no, no…
$ age         <dbl> 29, 27, 34, 27, 34, 26, 33, 29, 28, 29…
$ nearcollege <fct> no, no, no, yes, yes, yes, yes, yes, y…

代码部分简单带过，这里主要以原理介绍为主。

我们首先发现教育程度与未来收入确实是显著相关的p-value: < 2.2e-16。

> summary(lm(formula = wage ~ education, data = my_data))

Call:
lm(formula = wage ~ education, data = my_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-576.09 -173.36  -34.12  127.82 1686.25 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  183.949     23.104   7.962 2.38e-15 ***
education     29.655      1.708  17.368  < 2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 250.7 on 3008 degrees of freedom
Multiple R-squared:  0.09114,	Adjusted R-squared:  0.09084 
F-statistic: 301.6 on 1 and 3008 DF,  p-value: < 2.2e-16