感谢您提供您的数据集(我希望该链接永远有效,以便每个人都可以访问)。我将其读入数据框train。
使用How to debug "contrasts can be applied only to factors with 2 or more levels" error?提供的debug_contr_error、debug_contr_error2和NA_preproc辅助函数,我们可以轻松分析问题。
info <- debug_contr_error2(log_SalePrice ~ ., train)
## the data frame that is actually used by `lm`
dat <- info$mf
## number of cases in your dataset
nrow(train)
#[1] 1460
## number of complete cases used by `lm`
nrow(dat)
#[1] 1112
## number of levels for all factor variables in `dat`
info$nlevels
# MSZoning Street Alley LotShape LandContour
# 4 2 3 4 4
# Utilities LotConfig LandSlope Neighborhood Condition1
# 1 5 3 25 9
# Condition2 BldgType HouseStyle RoofStyle RoofMatl
# 6 5 8 5 7
# Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
# 14 16 4 4 4
# Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
# 6 5 5 5 7
# BsmtFinType2 Heating HeatingQC CentralAir Electrical
# 7 5 5 2 5
# KitchenQual Functional FireplaceQu GarageType GarageFinish
# 4 6 6 6 3
# GarageQual GarageCond PavedDrive PoolQC Fence
# 5 5 3 4 5
# MiscFeature SaleType SaleCondition MiscVal_bool MoYrSold
# 4 9 6 2 55
如您所见,Utilities 是此处的违规变量,因为它只有 1 级。
由于train 中有很多字符/因子变量,我想知道您是否有NA 用于它们。如果我们添加NA 作为有效级别,我们可能会得到更完整的案例。
new_train <- NA_preproc(train)
new_info <- debug_contr_error2(log_SalePrice ~ ., new_train)
new_dat <- new_info$mf
nrow(new_dat)
#[1] 1121
new_info$nlevels
# MSZoning Street Alley LotShape LandContour
# 5 2 3 4 4
# Utilities LotConfig LandSlope Neighborhood Condition1
# 1 5 3 25 9
# Condition2 BldgType HouseStyle RoofStyle RoofMatl
# 6 5 8 5 7
# Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
# 14 16 4 4 4
# Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
# 6 5 5 5 7
# BsmtFinType2 Heating HeatingQC CentralAir Electrical
# 7 5 5 2 6
# KitchenQual Functional FireplaceQu GarageType GarageFinish
# 4 6 6 6 3
# GarageQual GarageCond PavedDrive PoolQC Fence
# 5 5 3 4 5
# MiscFeature SaleType SaleCondition MiscVal_bool MoYrSold
# 4 9 6 2 55
我们确实得到了更多完整的案例,但Utilities 仍然有一个级别。这意味着大多数不完整的情况实际上是由您的数值变量中的NA 引起的,我们无能为力(除非您有统计上有效的方法来估算这些缺失值)。
由于您只有一个单级因子变量,因此与How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"? 中给出的方法相同。
new_dat$Utilities <- 1
simplelm <- lm(log_SalePrice ~ 0 + ., data = new_dat)
模型现在运行成功。但是,它是rank-deficient。您可能想要做点什么来解决它,但保持原样也可以。
b <- coef(simplelm)
length(b)
#[1] 301
sum(is.na(b))
#[1] 9
simplelm$rank
#[1] 292