【问题标题】:Convert a factor into binary dummies but not all factors present将一个因子转换为二进制虚拟变量,但并非所有因子都存在
【发布时间】:2019-01-21 14:09:54
【问题描述】:

我有许多数据帧,其中包含我希望扩展为许多二进制等效项(一种热编码)的因素。然而,在每个数据框中并不是所有可能的因素都存在,但我知道所有可能的因素是什么(有 70 个这样的因素)。我想将所有可能的二进制虚拟对象添加到每个数据帧中。

通过下面的代码,我可以在每个数据帧中创建虚拟对象,但不能创建所有可能的虚拟对象。例如,set1.df 没有任何人属于“E”或“F”类别,而 set2.df 没有任何人属于“D”类别。需要的是 set1.df 中的附加列 set1.dfE set1.dfF 全部为 0,而 set2.df 中的列 set2.dfD 全部为零。在创建假人之前,我无法对 set1.df 和 set2.df 进行 rbind,因为我需要在 rbinding 之前使用二进制变量对每个数据帧进行一些处理。只是为了重申一下,我事先知道我的数据中可能有哪些级别,例如“A”到“F”。

library(dummies)

person_id <- c(1,2,3,4,5,6,7,8,9,10)
person_cat <- c("A","B","C","A","B","C","D","A","A","A")
set1.df <- data.frame(person_id,person_cat)

person_id <- c(11,12,13,14,15,16,17,18,19,20)
person_cat <- c("A","B","C","A","B","C","E","E","F","A")
set2.df <- data.frame(person_id,person_cat)

dummies1 <- dummy(set1.df[,2])
dummies2 <- dummy(set2.df[,2])

dummies1
dummies2

预期的输出是:

> dummies1
      set1.dfA set1.dfB set1.dfC set1.dfD set1.dfE set1.dfF
 [1,]        1        0        0        0        0        0
 [2,]        0        1        0        0        0        0
 [3,]        0        0        1        0        0        0
 [4,]        1        0        0        0        0        0
 [5,]        0        1        0        0        0        0
 [6,]        0        0        1        0        0        0
 [7,]        0        0        0        1        0        0
 [8,]        1        0        0        0        0        0
 [9,]        1        0        0        0        0        0
[10,]        1        0        0        0        0        0
> dummies2
      set2.dfA set2.dfB set2.dfC set2.df$D set2.dfE set2.dfF
 [1,]        1        0        0        0        0        0
 [2,]        0        1        0        0        0        0
 [3,]        0        0        1        0        0        0
 [4,]        1        0        0        0        0        0
 [5,]        0        1        0        0        0        0
 [6,]        0        0        1        0        0        0
 [7,]        0        0        0        0        1        0
 [8,]        0        0        0        0        1        0
 [9,]        0        0        0        0        0        1
[10,]        1        0        0        0        0        0

【问题讨论】:

  • 使用factor() 定义变量并使用levels = 参数添加所有必要的级别。
  • I can not rbind set1.df and set2.df before creating the dummies because I need to do some processing of each data frame using the binary variables before rbinding 我相信这个说法可能会受到质疑。例如,您可能想熟悉与 dplyr::mutate 一起使用的 dplyr::group_by
  • 我添加了以下形式的行:person_cat&lt;-factor(person_cat,levels=c("A","B","C","D","E","F")),虽然数据框现在有 6 个级别,但“dummy”会忽略未使用的级别。为“dummy”添加drop=FALSE 选项可解决此问题并创建必要的变量。谢谢。
  • 您可以为自己的问题写一个答案并接受它。

标签: r one-hot-encoding


【解决方案1】:

这是一种解决方案:

levels <- c('A', 'B', 'C', 'D', 'E', 'F')

data <- data.frame(matrix(NA, nrow = length(person_id), ncol = length(levels)))
names(data) <- levels 
for (i in 1:nrow(data)) {
  for (j in 1:length(data)){
    data[i, j] <- ifelse(set1.df[i, 2] == names(data)[j], 1, 0)
  }
}

您应该创建一个空数据框,其行数与 ID 相同,列数与 set1.df 中的级别一样多。然后,使用循环评估每一列中的 person_cat。只有当 person_cat 等于列名(category_level)时,单元格的值才会为 1。

【讨论】:

    【解决方案2】:
     library(dummies)
    
    person_id <- c(1,2,3,4,5,6,7,8,9,10)
    person_cat <- c("A","B","C","A","B","C","D","A","A","A")
    person_cat < -factor(person_cat,levels=c("A","B","C","D","E","F"))
    set1.df <- data.frame(person_id,person_cat)
    
    person_id <- c(11,12,13,14,15,16,17,18,19,20)
    person_cat <- c("A","B","C","A","B","C","E","E","F","A")
    person_cat <- factor(person_cat,levels=c("A","B","C","D","E","F"))
    set2.df <- data.frame(person_id,person_cat)
    
    dummies1 <- dummy(set1.df[,2],drop=FALSE)
    dummies2 <- dummy(set2.df[,2],drop=FALSE)
    
    dummies1
    dummies2
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-03-23
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-10-07
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多