【问题标题】:Creating a dummy variable in R using loan default data使用贷款默认数据在 R 中创建一个虚拟变量
【发布时间】:2017-03-23 04:15:28
【问题描述】:

我正在使用 Lending Club 数据集,并且正在尝试为目标变量loan_status 创建一个虚拟变量。所以我的主要目标是 Charged Off 为 0,Fully Paid 为 1,其他的都是“NA”。可变贷款状态有几个值:当前、全额支付、逾期、宽限期、拖欠、已注销和不符合信用状况。我只想专注于冲销和全额支付。我已经尝试了很多次,但仍然没有成功。例如:

创建一个新的目标变量

loan_status1 <- if(loan_status== 'Fully Paid'){'Yes'} else if
 (loan_status== 'Charged Off') {'No'} else 'NA'

我也试过这个:

if(loan_status=='Fully Paid'){
   0} else if (loan_status=='Charged Off') {
   1} else (loan_status=='NA')

如有任何指导,我将不胜感激。

【问题讨论】:

  • 最简单的方法是使用矢量化ifelse,试试loan_status1 &lt;-ifelse(loan_status == 'Fully Paid', 1, ifelse(loan_status == 'Charged Off', 0, NA))

标签: r


【解决方案1】:

基本上,您可以尝试通过执行以下命令对数据运行 for 循环: 不要将 NA 设置为字符串('NA'),最好设置为数据类型 NA

loan_status <- sample(rep(c('Fully Paid', 'Charged Off', "abc"), 100), 100, replace = FALSE)

for (i in seq_along(loan_status)){
  if (loan_status[i] == 'Fully Paid'){
    loan_status[i] <- as.integer(0)
  } else if (loan_status[i] == 'Charged Off'){
    loan_status[i] <- as.integer(1)
  } else {
    loan_status[i] == NA
  }
}

也许你想用 factor() 函数简单地做到这一点:

例如你可以这样做:

factor(loan_status, levels = c('Fully Paid', 'Charged Off'), labels = c(0, 1))

【讨论】:

  • 我会赞成你对factor 方法的回答。但是for 循环是不行的。这是Ronak Shah's vectorized ifelse approach:loan_status1 &lt;- ifelse(loan_status == 'Fully Paid', 1, ifelse(loan_status == 'Charged Off', 0, NA))的笨拙而缓慢的重新实现
  • 糟糕,for 循环根本不起作用。它返回“0”和“1”作为字符,并保持“abc”不变。原因:loan_status[i] == NA 应为 loan_status[i] &lt;- NA
【解决方案2】:

OP 请求对选定值进行 1:1 替换,即仅涉及一个数据字段。除了嵌套的ifelseapproach,这可以通过使用因子或join来完成更大的数据。

如果需要替换两个或三个以上的值,“硬编码”嵌套 ifelse 方法很容易变得不方便。

因素案例 1:是,否

# create some data
loan_status <- c("Fully Paid", "Charged Off", "Something", "Else")
# do the conversion
factor(loan_status, levels = c("Fully Paid", "Charged Off"), labels = c("Yes", "No"))
#[1] Yes  No   <NA> <NA>
#Levels: Yes No

或者,

as.character(factor(loan_status, levels = c("Fully Paid", "Charged Off"), labels = c("Yes", "No")))
#[1] "Yes" "No"  NA    NA  

如果预期结果为字符。

因式 2:0L、1L 作为整数

如果预期结果是整数类型,仍然可以使用因子方法,但需要进行额外的转换。

as.integer(as.character(factor(loan_status, levels = c("Fully Paid", "Charged Off"), labels = c("0", "1"))))
#[1]  0  1 NA NA

请注意,此处必须转换为字符。否则,结果将返回因子水平的数量:

as.integer(factor(loan_status, levels = c("Fully Paid", "Charged Off"), labels = c("0", "1")))
#[1]  1  2 NA NA

加入

如果数据较大且要替换许多项目,使用data.table 连接可能是值得考虑的替代方案:

library(data.table)
# create translation table
translation_map <- data.table(
  loan_status = c("Fully Paid", "Charged Off"),
  target = c(0L, 1L))
# create some user data
DT <- data.table(id = LETTERS[1:4],
                 loan_status = c("Fully Paid", "Charged Off", "Something", "Else"))
DT
#   id loan_status
#1:  A  Fully Paid
#2:  B Charged Off
#3:  C   Something
#4:  D        Else

# right join
translation_map[DT, on = "loan_status"]
#   loan_status target id
#1:  Fully Paid      0  A
#2: Charged Off      1  B
#3:   Something     NA  C
#4:        Else     NA  D

默认情况下 (nomatch = NA),data.table 进行右连接,即获取 DT 的所有行。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-03-24
    • 1970-01-01
    • 2013-09-23
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多