【发布时间】:2020-07-29 15:20:49
【问题描述】:
我正在尝试在 Stata 中重现 R 的结果(请注意,以下数据是虚构的,仅作为示例)。然而,出于某种原因,Stata 处理某些问题的方式似乎与 R 不同。它会选择不同的虚拟变量来排除多重共线性。
我已经发布了一个相关问题,处理这些国家年份虚拟变量被删除的统计影响here。
在下面的示例中,R 踢出 2,而 Stata 踢出 3,导致不同的结果。例如,检查vote 和vote_won 的系数和p 值。
本质上,我想知道的只是如何与 R 或 Stata 进行通信,以排除哪些变量,以便它们都执行相同的操作。
数据
数据如下:
library(data.table)
library(dplyr)
library(foreign)
library(censReg)
library(wooldridge)
data('mroz')
year= c(2005, 2010)
country = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
n <- 2
DT <- data.table( country = rep(sample(country, length(mroz), replace = T), each = n),
year = c(replicate(length(mroz), sample(year, n))))
x <- DT
DT <- rbind(DT, DT); DT <- rbind(DT, DT); DT <- rbind(DT, DT) ; DT <- rbind(DT, DT); DT <- rbind(DT, x)
mroz <- mroz[-c(749:753),]
DT <- cbind(mroz, DT)
DT <- DT %>%
group_by(country) %>%
mutate(base_rate = as.integer(runif(1, 12.5, 37.5))) %>%
group_by(country, year) %>%
mutate(taxrate = base_rate + as.integer(runif(1,-2.5,+2.5)))
DT <- DT %>%
group_by(country, year) %>%
mutate(vote = sample(c(0,1),1),
votewon = ifelse(vote==1, sample(c(0,1),1),0))
rm(mroz,x, country, year)
R 中的 lm 回归
summary(lm(educ ~ exper + I(exper^2) + vote + votewon + country:as.factor(year), data=DT))
Call:
lm(formula = educ ~ exper + I(exper^2) + vote + votewon + country:as.factor(year),
data = DT)
Residuals:
Min 1Q Median 3Q Max
-7.450 -0.805 -0.268 0.954 5.332
Coefficients: (3 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.170064 0.418578 26.69 < 0.0000000000000002 ***
exper 0.103880 0.029912 3.47 0.00055 ***
I(exper^2) -0.002965 0.000966 -3.07 0.00222 **
vote 0.576865 0.504540 1.14 0.25327
votewon 0.622522 0.636241 0.98 0.32818
countryA:as.factor(year)2005 -0.196348 0.503245 -0.39 0.69653
countryB:as.factor(year)2005 -0.530681 0.616653 -0.86 0.38975
countryC:as.factor(year)2005 0.650166 0.552019 1.18 0.23926
countryD:as.factor(year)2005 -0.515195 0.638060 -0.81 0.41968
countryE:as.factor(year)2005 0.731681 0.502807 1.46 0.14605
countryG:as.factor(year)2005 0.213345 0.674642 0.32 0.75192
countryH:as.factor(year)2005 -0.811374 0.637254 -1.27 0.20334
countryI:as.factor(year)2005 0.584787 0.503606 1.16 0.24594
countryJ:as.factor(year)2005 0.554397 0.674789 0.82 0.41158
countryA:as.factor(year)2010 0.388603 0.503358 0.77 0.44035
countryB:as.factor(year)2010 -0.727834 0.617210 -1.18 0.23869
countryC:as.factor(year)2010 -0.308601 0.504041 -0.61 0.54056
countryD:as.factor(year)2010 0.785603 0.503165 1.56 0.11888
countryE:as.factor(year)2010 0.280305 0.452293 0.62 0.53562
countryG:as.factor(year)2010 0.672074 0.674721 1.00 0.31954
countryH:as.factor(year)2010 NA NA NA NA
countryI:as.factor(year)2010 NA NA NA NA
countryJ:as.factor(year)2010 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.3 on 728 degrees of freedom
Multiple R-squared: 0.037, Adjusted R-squared: 0.0119
F-statistic: 1.47 on 19 and 728 DF, p-value: 0.0882
Stata 中的相同回归
write.dta(DT, "C:/Users/.../mroz_adapted.dta")
encode country, gen(n_country)
reg educ c.exper c.exper#c.exper vote votewon n_country#i.year
note: 9.n_country#2010.year omitted because of collinearity
note: 10.n_country#2010.year omitted because of collinearity
Source | SS df MS Number of obs = 748
-------------+---------------------------------- F(21, 726) = 1.80
Model | 192.989406 21 9.18997171 Prob > F = 0.0154
Residual | 3705.47583 726 5.1039612 R-squared = 0.0495
-------------+---------------------------------- Adj R-squared = 0.0220
Total | 3898.46524 747 5.21882897 Root MSE = 2.2592
---------------------------------------------------------------------------------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
exper | .1109858 .0297829 3.73 0.000 .052515 .1694567
|
c.exper#c.exper | -.0031891 .000963 -3.31 0.001 -.0050796 -.0012986
|
vote | .0697273 .4477115 0.16 0.876 -.8092365 .9486911
votewon | -.0147825 .6329659 -0.02 0.981 -1.257445 1.227879
|
n_country#year |
A#2010 | .0858634 .4475956 0.19 0.848 -.7928728 .9645997
B#2005 | -.4950677 .5003744 -0.99 0.323 -1.477421 .4872858
B#2010 | .0951657 .5010335 0.19 0.849 -.8884818 1.078813
C#2005 | -.5162827 .447755 -1.15 0.249 -1.395332 .3627664
C#2010 | -.0151834 .4478624 -0.03 0.973 -.8944434 .8640767
D#2005 | .3664596 .5008503 0.73 0.465 -.6168283 1.349747
D#2010 | .5119858 .500727 1.02 0.307 -.4710599 1.495031
E#2005 | .5837942 .6717616 0.87 0.385 -.7350329 1.902621
E#2010 | .185601 .5010855 0.37 0.711 -.7981486 1.169351
F#2005 | .5987978 .6333009 0.95 0.345 -.6445219 1.842117
F#2010 | .4853639 .7763936 0.63 0.532 -1.038881 2.009608
G#2005 | -.3341302 .6328998 -0.53 0.598 -1.576663 .9084021
G#2010 | .2873193 .6334566 0.45 0.650 -.956306 1.530945
H#2005 | -.4365233 .4195984 -1.04 0.299 -1.260294 .3872479
H#2010 | -.1683725 .6134262 -0.27 0.784 -1.372673 1.035928
I#2005 | -.39264 .7755549 -0.51 0.613 -1.915238 1.129958
I#2010 | 0 (omitted)
J#2005 | 1.036108 .4476018 2.31 0.021 .1573591 1.914856
J#2010 | 0 (omitted)
|
_cons | 11.58369 .350721 33.03 0.000 10.89514 12.27224
---------------------------------------------------------------------------------
【问题讨论】:
-
答案是你不能,这个问题已经在这里问过很多次了。
-
@rawr 谢谢你的评论,但我不认为你是正确的。假设不是完美的多重共线性。其原因是无法在存在完美多重共线性的情况下对其进行估计。然而,在移除导致完全多重共线性的变量后,该假设在技术上不再违反。我并不是说这没有任何后果(这也是我在交叉验证/stackexchange 上发布问题的原因)。在任何情况下,即使是高度的多重共线性也可以很好,并且不违反任何假设。
标签: r regression stata