序言:
我的感觉是,这里的部分问题是我们没有一个好的示例数据集可以解决。
我的感觉是,我们可以通过创建示例数据集来构建解决方案讨论来解决这个问题。为此,一个有用的包是Wakefield 包,它允许创建模拟数据集。
例如,我们可能会创建一个包含 2000 人的数据集,其中缺少一些年龄、性别、就业状况、教育数据和婚姻状况信息。
插补
核心问题是我们能否从数据集中的其他数据中推断年龄或性别?
例如,如果我们不知道某人的年龄,我们可以根据他们的婚姻状况、就业类型和/或他们的教育水平来推断吗?在非常简单的层面上,我们可以简单地搜索年龄为 NA 的条目,然后查看婚姻状况。如果婚姻状况是“已婚”,那么我们推测我们的数据集是针对美国人的,并查看平均结婚年龄并替换为已婚人士的估计年龄。
我们可以对此进行扩展,并通过考虑更多变量来使我们的估计更加准确。例如,我们可能会同时查看婚姻状况、教育水平和就业状况,以进一步改进我们的年龄估计。如果一个人已婚,拥有博士学位并退休,我们会将年龄推高。如果一个人是单身,一个学生我们把年龄推低。除此之外,我们可以查看数据集中的年龄分布,以估算有关缺失值的数据。
生成示例数据集。
# packages
requiredPackages <- c("wakefield", "dplyr", "BaylorEdPsych", "mice", "MCMCpack")
ipak <- function(pkg) {
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg)) {
install.packages(new.pkg, dependencies = TRUE)
}
sapply(pkg, require, character.only = TRUE)
}
ipak(requiredPackages)
# generate some data for Males with a 8% missing value for
# age
set.seed(10)
f.df <- r_data_frame(n = 1000, age,
gender(x = c("M", "F"), prob = c(0, 1), name = "Gender"),
employment(x = c("Full Time", "Part Time", "Unemployed", "Retired", "Student"), prob = c(0.6, 0.1, 0.1, 0.1, 0.1), name = "Employment"),
education(x = c("No Schooling Completed", "Nursery School to 8th Grade", "9th Grade to 12th Grade, No Diploma",
"Regular High School Diploma", "GED or Alternative Credential", "Some College, Less than 1 Year", "Some College, 1 or More Years, No Degree",
"Associate's Degree", "Bachelor's Degree", "Master's Degree", "Professional School Degree", "Doctorate Degree"),
prob = c(0.013, 0.05, 0.085, 0.246, 0.039, 0.064, 0.15, 0.075, 0.176, 0.072, 0.019, 0.012), name = "Education"),
marital(x = c("Married", "Divorced", "Widowed", "Separated", "Never Married"), prob = NULL, name = "Marital")) %>% r_na(cols = 1 - 3, prob = 0.05)
# str(f.df)
summary(f.df)
set.seed(20)
# generate some data for Females with a 5% missing value for
# age
m.df <- r_data_frame(n = 1000, age,
gender(x = c("M", "F"), prob = c(1, 0), name = "Gender"),
employment(x = c("Full Time", "Part Time", "Unemployed", "Retired", "Student"), prob = c(0.6, 0.1, 0.1, 0.1, 0.1), name = "Employment"),
education(x = c("No Schooling Completed", "Nursery School to 8th Grade", "9th Grade to 12th Grade, No Diploma",
"Regular High School Diploma", "GED or Alternative Credential", "Some College, Less than 1 Year", "Some College, 1 or More Years, No Degree",
"Associate's Degree", "Bachelor's Degree", "Master's Degree", "Professional School Degree", "Doctorate Degree"),
prob = c(0.013,0.05, 0.085, 0.246, 0.039, 0.064, 0.15, 0.075, 0.176, 0.072,0.019, 0.012), name = "Education"),
marital(x = c("Married", "Divorced", "Widowed", "Separated", "Never Married"), prob = NULL, name = "Marital")) %>% r_na(cols = 1 - 3, prob = 0.03)
summary(m.df)
all.df = rbind.data.frame(m.df, f.df)
summary(all.df)
数据汇总
> summary(all.df)
Age Gender Employment Education Marital
Min. :18.00 M:1000 Full Time :1142 Regular High School Diploma :459 Married :394
1st Qu.:35.00 F:1000 Part Time : 207 Bachelor's Degree :356 Divorced :378
Median :54.00 Unemployed: 193 Some College, 1 or More Years, No Degree:284 Widowed :411
Mean :53.76 Retired : 182 9th Grade to 12th Grade, No Diploma :156 Separated :379
3rd Qu.:72.00 Student : 196 Associate's Degree :145 Never Married:358
Max. :89.00 NA's : 80 (Other) :520 NA's : 80
NA's :80 NA's : 80
>
数据是完全随机丢失还是不随机丢失?
# Test for MCAR - Missing at Completely at Random...
test_mcar <- LittleMCAR(all.df)
print(test_mcar$amount.missing)
print(test_mcar$p.value)
控制台输出
> # Test for MCAR - Missing at Completely at Random...
> test_mcar <- LittleMCAR(all.df)
this could take a while> print(test_mcar$amount.missing)
Age Gender Employment Education Marital
Number Missing 80.00 0 80.00 80.00 80.00
Percent Missing 0.04 0 0.04 0.04 0.04
> print(test_mcar$p.value)
[1] 0.02661428
数据插补
好的,我们先来看一下缺失值的分布。我们可以运行mice::md.pattern() 函数,以显示缺失值在数据框中其他列上的分布。 md.pattern() 函数输出对于建议哪些变量可能是用于填补缺失值的良好候选者很有用:
> md.pattern(all.df)
Gender Age Employment Education Marital
1696 1 1 1 1 1 0
73 1 1 1 1 0 1
73 1 1 1 0 1 1
2 1 1 1 0 0 2
71 1 1 0 1 1 1
3 1 1 0 1 0 2
2 1 1 0 0 1 2
71 1 0 1 1 1 1
2 1 0 1 1 0 2
3 1 0 1 0 1 2
4 1 0 0 1 1 2
0 80 80 80 80 320
好的,我们现在可以开始估算缺失值了:
imp <- mice(all.df, m = 5, maxit = 50, seed = 1234, printFlag = FALSE)
- m=5 参数指定您最终得到变量的五个可能的插补
- maxit=50 参数指定算法在收敛到一个解之前最多有 50 次迭代,并且可以向上或向下调整到所需的精度
mice() 函数可能需要一段时间,具体取决于我们指定的迭代次数。在这种情况下,完成后,我们可以使用 head() 函数看到 Age 的一些估算值:
head(imp$imp$Age)
1 2 3 4 5
7 28 49 37 70 89
33 55 54 52 88 24
56 89 83 68 71 61
84 43 43 24 30 31
96 28 64 89 41 50
120 47 34 36 22 77
要真正完成插补,我们必须运行complete() 函数并将结果分配给一个新的数据帧。此版本的complete() 函数将通过“long”参数收集指定数据帧中的所有插补:
all_imputed_df <- complete(imp, "long", include = TRUE)
table(all_imputed_df$.imp, is.na(all_imputed_df$Age))
控制台:
> all_imputed_df <- complete(imp, "long", include = TRUE)
> table(all_imputed_df$.imp, is.na(all_imputed_df$Age))
FALSE TRUE
0 1920 80
1 2000 0
2 2000 0
3 2000 0
4 2000 0
5 2000 0
现在我们有一个包含 12000 个年龄值的数据集,涵盖 5 个年龄输入值。
让我们尝试使用插补 #3 进行回归。
首先,提取 impute #3
impute.3 <- subset(all_imputed_df,.imp=='3')
summary(impute.3)
控制台:
> impute.3 <- subset(all_imputed_df, .imp == "3")
> summary(impute.3)
.imp .id Age Gender Employment
Min. :3 Min. : 1.0 Min. :18.00 M:1000 Full Time :1192
1st Qu.:3 1st Qu.: 500.8 1st Qu.:35.00 F:1000 Part Time : 211
Median :3 Median :1000.5 Median :54.00 Unemployed: 202
Mean :3 Mean :1000.5 Mean :53.89 Retired : 191
3rd Qu.:3 3rd Qu.:1500.2 3rd Qu.:72.00 Student : 204
Max. :3 Max. :2000.0 Max. :89.00
Education Marital
Regular High School Diploma :478 Married :416
Bachelor's Degree :376 Divorced :390
Some College, 1 or More Years, No Degree:295 Widowed :425
9th Grade to 12th Grade, No Diploma :168 Separated :393
Associate's Degree :150 Never Married:376
Master's Degree :141
(Other) :392
现在我们可以运行线性回归了:
> lm(Age ~ Education + Gender + Employment + Marital, data = impute.3)
Call:
lm(formula = Age ~ Education + Gender + Employment + Marital,
data = impute.3)
Coefficients:
(Intercept) EducationNursery School to 8th Grade
51.6733 1.4100
Education9th Grade to 12th Grade, No Diploma EducationRegular High School Diploma
1.3675 0.7611
EducationGED or Alternative Credential EducationSome College, Less than 1 Year
1.0365 -2.6069
EducationSome College, 1 or More Years, No Degree EducationAssociate's Degree
0.3563 0.9506
EducationBachelor's Degree EducationMaster's Degree
1.2505 -1.6372
EducationProfessional School Degree EducationDoctorate Degree
1.1774 0.4936
GenderF EmploymentPart Time
-0.3190 1.1316
EmploymentUnemployed EmploymentRetired
3.1622 -0.6855
EmploymentStudent MaritalDivorced
3.0850 0.2934
MaritalWidowed MaritalSeparated
2.3162 1.6833
MaritalNever Married
1.6169
MCMCRegress
library(MCMCpack) # b0 = prior mean, B0 = prior precision = 1/variance
fitBayes <- MCMCregress(Age ~ Education + Gender + Employment + Marital, data = impute.3, mcmc = 10000, seed = 1234, b0 = 0, B0 = 0.01, drop.unused.levels = TRUE)
summary(fitBayes)
控制台输出
> fitBayes <- MCMCregress(Age ~ Education + Gender + Employment + Marital, data = impute.3, mcmc = 10000, seed = 1234, b0 = 0, B0 = 0.01, drop.unused.levels = TRUE)
> summary(fitBayes)
Iterations = 1001:11000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
(Intercept) 48.67377 2.5337 0.025337 0.025337
EducationNursery School to 8th Grade 3.77088 3.0514 0.030514 0.030514
Education9th Grade to 12th Grade, No Diploma 3.81009 2.7794 0.027794 0.027794
EducationRegular High School Diploma 3.24531 2.4933 0.024933 0.025412
EducationGED or Alternative Credential 3.38733 3.2155 0.032155 0.032155
EducationSome College, Less than 1 Year -0.08419 2.9104 0.029104 0.029577
EducationSome College, 1 or More Years, No Degree 2.82889 2.6092 0.026092 0.026092
EducationAssociate's Degree 3.32932 2.8410 0.028410 0.028410
EducationBachelor's Degree 3.72272 2.5228 0.025228 0.025659
EducationMaster's Degree 0.87738 2.8611 0.028611 0.028611
EducationProfessional School Degree 3.27542 4.0199 0.040199 0.040199
EducationDoctorate Degree 2.43794 4.5996 0.045996 0.045996
GenderF -0.11321 0.9327 0.009327 0.009327
EmploymentPart Time 1.25556 1.5756 0.015756 0.016170
EmploymentUnemployed 3.27395 1.6213 0.016213 0.015708
EmploymentRetired -0.52614 1.6394 0.016394 0.016394
EmploymentStudent 3.17027 1.6058 0.016058 0.016889
MaritalDivorced 0.72379 1.4715 0.014715 0.014715
MaritalWidowed 2.73130 1.4394 0.014394 0.014706
MaritalSeparated 2.10423 1.4608 0.014608 0.014608
MaritalNever Married 2.00781 1.4960 0.014960 0.014960
sigma2 448.01488 14.0715 0.140715 0.140715
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
(Intercept) 43.75477 46.9556 48.6619 50.3967 53.609
EducationNursery School to 8th Grade -2.19290 1.7079 3.7701 5.8216 9.718
Education9th Grade to 12th Grade, No Diploma -1.59323 1.9586 3.8326 5.6676 9.349
EducationRegular High School Diploma -1.61001 1.5641 3.2474 4.9296 8.155
EducationGED or Alternative Credential -2.88523 1.2095 3.4173 5.5405 9.691
EducationSome College, Less than 1 Year -5.75364 -2.0617 -0.1009 1.8986 5.614
EducationSome College, 1 or More Years, No Degree -2.28754 1.0853 2.8608 4.5718 7.895
EducationAssociate's Degree -2.27611 1.4311 3.3285 5.2330 8.978
EducationBachelor's Degree -1.21780 2.0258 3.7275 5.4203 8.655
EducationMaster's Degree -4.61270 -1.0872 0.8601 2.8484 6.456
EducationProfessional School Degree -4.63027 0.5900 3.2767 5.9475 11.059
EducationDoctorate Degree -6.47767 -0.6371 2.4553 5.4188 11.705
GenderF -1.95673 -0.7298 -0.1067 0.4903 1.727
EmploymentPart Time -1.82784 0.1849 1.2597 2.3160 4.354
EmploymentUnemployed 0.09335 2.1988 3.2674 4.3557 6.433
EmploymentRetired -3.80162 -1.6316 -0.5147 0.5953 2.706
EmploymentStudent 0.03387 2.0713 3.1502 4.2227 6.342
MaritalDivorced -2.15073 -0.2732 0.7249 1.7266 3.602
MaritalWidowed -0.13488 1.7817 2.7367 3.6961 5.567
MaritalSeparated -0.76396 1.1177 2.1118 3.0700 5.001
MaritalNever Married -0.92230 0.9950 1.9976 3.0248 4.898
sigma2 420.98019 438.4621 447.7222 457.2730 476.481
希望上述观察指向正确的方向。
引文: