Data.Table 有条件地操作许多列答案

【问题标题】：Data.Table Manipulate Many Columns ConditionallyData.Table 有条件地操作许多列
【发布时间】：2020-07-08 03:22:10
【问题描述】：

data1=data.frame(Year=c(2010,2010,2010,2011,2011,2011,2010,2010,2010,2011,2011,2011),
                Group=c(1,1,1,1,1,1,2,2,2,2,2,2),
                Class=c('A','B','C','A','B','C','A','B','C','A','B','C'),
                A=c(0.73,0.55,0.54,0.49,0.52,0.49,0.26,0.55,0.39,0.34,0.84,0.29),
                B=c(0.12,0.08,0.14,0.21,0.33,0.98,0.33,0.99,0.02,0.59,0.27,0.72),
                C=c(0.43,0.51,0.29,0.6,0.28,0.97,0.78,0.84,0.34,0.82,0.75,0.97))


##>data1
##    Year Group Class    A    B    C
## 1  2010     1     A 0.73 0.12 0.43
## 2  2010     1     B 0.55 0.08 0.51
## 3  2010     1     C 0.54 0.14 0.29
## 4  2011     1     A 0.49 0.21 0.60
## 5  2011     1     B 0.52 0.33 0.28
## 6  2011     1     C 0.49 0.98 0.97
## 7  2010     2     A 0.26 0.33 0.78
## 8  2010     2     B 0.55 0.99 0.84
## 9  2010     2     C 0.39 0.02 0.34
## 10 2011     2     A 0.34 0.59 0.82
## 11 2011     2     B 0.84 0.27 0.75
## 12 2011     2     C 0.29 0.72 0.97

我有“data1”并希望制作“data2”。 “data2”将具有与“data1”相同的精确尺寸，但我希望制定以下条件，

如果类 = 'A'，则 'B' 列 = (1-B)*0.05，'C' 列 = (1-C)*0.05，在更新 Column 'B' 和 Column 'C' 之后，我们计算 Column 'A' = 1- (B+C)。

如果类 = 'B'，则 'A' 列 = (1-A)*0.05，'C' 列 = (1-C)*0.05，并且在更新 Column 'A' 和 Column 'C' 之后，我们计算 Column 'B' = 1- (A+C)。

如果类 = 'C'，则 'A' 列 = (1-A)*0.05，'B' 列 = (1-B)*0.05，然后 > 更新 'A' 列和 'B' 列，我们计算列 'C' = 1- (A+B)。

我对高效的 data.table 解决方案抱有希望，因为我有非常大的数据集，其中的“类”多于 3 个。

这是进行有希望的更新的缓慢解决方案。

library(data.table)
setDT(data1)

data1[, newB := fifelse(Class == 'A', (1-B) * 0.05, NA_real_)]
data1[, newC := fifelse(Class == 'A', (1-C) * 0.05, NA_real_)]
data1[, newA := fifelse(Class == 'A', (1-(newB+newC)), NA_real_)]

data1[, newA := fifelse(Class == 'B', (1-A) * 0.05, newA)]
data1[, newC := fifelse(Class == 'B', (1-C) * 0.05, newC)]
data1[, newB := fifelse(Class == 'B', (1-(newA+newC)), newB)]

data1[, newA := fifelse(Class == 'C', (1-A) * 0.05, newA)]
data1[, newB := fifelse(Class == 'C', (1-B) * 0.05, newB)]
data1[, newC := fifelse(Class == 'C', (1-(newA+newB)), newC)]

【问题讨论】：

@akrun 非常适合 data.table 解决方案

标签： r data.table

【解决方案1】：

编辑好吧，这件事困扰了我一整天。这个怎么样：

data1[,.(Year, Group,
         A = if("A" == Class){1 - (((1-B) * 0.05) + ((1-C) * 0.05))}else{(1-A) * 0.05},
         B = if("B" == Class){1 - (((1-A) * 0.05) + ((1-C) * 0.05))}else{(1-B) * 0.05},
         C = if("C" == Class){1 - (((1-A) * 0.05) + ((1-B) * 0.05))}else{(1-C) * 0.05}),
         by=Class]
    Class Year Group      A      B      C
 1:     A 2010     1 0.9275 0.0440 0.0285
 2:     A 2011     1 0.9405 0.0395 0.0200
 3:     A 2010     2 0.9555 0.0335 0.0110
 4:     A 2011     2 0.9705 0.0205 0.0090
 5:     B 2010     1 0.0225 0.9530 0.0245
 6:     B 2011     1 0.0240 0.9400 0.0360
 7:     B 2010     2 0.0225 0.9695 0.0080
 8:     B 2011     2 0.0080 0.9795 0.0125
 9:     C 2010     1 0.0230 0.0430 0.9340
10:     C 2011     1 0.0255 0.0010 0.9735
11:     C 2010     2 0.0305 0.0490 0.9205
12:     C 2011     2 0.0355 0.0140 0.9505

它可以在不到一秒的时间内处理 10,000,000 行。

data1 <- data.table(Year = rep(2011:2020,each=1000000),Group = rep(1:10,times=1000000),Class = LETTERS[1:3], A = runif(1000000,0,1),B = runif(1000000,0,1),C = runif(1000000,0,1))
data1
          Year Group Class            A            B             C
       1: 2011     1     A 0.2890449290 0.6917136966 0.79943333357
       2: 2011     2     B 0.6496694945 0.2168088856 0.61779720359
       3: 2011     3     C 0.8413182027 0.9084385505 0.90381150902
       4: 2011     4     A 0.7272625659 0.4355531749 0.91872303933
       5: 2011     5     B 0.7147752908 0.9534050962 0.75510455621
      ---                                                         
 9999996: 2020     6     C 0.7728334034 0.9656879159 0.03099721554
 9999997: 2020     7     A 0.8534086784 0.2145124320 0.74231260596
 9999998: 2020     8     B 0.4714033590 0.0653402030 0.63881201576
 9999999: 2020     9     C 0.5170788274 0.4878072820 0.53781165020
10000000: 2020    10     A 0.8130705466 0.6612007422 0.16215236858

microbenchmark(data1[,.(Year, Group,
+          A = if("A" == Class){1 - (((1-B) * 0.05) + ((1-C) * 0.05))}else{(1-A) * 0.05},
+          B = if("B" == Class){1 - (((1-A) * 0.05) + ((1-C) * 0.05))}else{(1-B) * 0.05},
+          C = if("C" == Class){1 - (((1-A) * 0.05) + ((1-B) * 0.05))}else{(1-C) * 0.05}),by=Class])
Unit: milliseconds
        min          lq        mean     median          uq        max neval
 538.850986 638.5327615 895.7115241 808.087257 999.4477005 2146.21263   100

【讨论】：

感谢您的更新，我能够更好地理解。查看我编辑的方法。
非常感谢您的时间和精力！！
campell 是首选："A" == Class。我问是因为通常我会看到 Class == "A"
应该没有效果。我之前在使用强制因素来搞乱，我认为这可能会产生影响。

【解决方案2】：

我建议如下：

# Setting the dataframe as a data.table
data1 <- data.table::setDT(data1)
head(data1)

   Year Group Class    A    B    C
1: 2010     1     A 0.73 0.12 0.43
2: 2010     1     B 0.55 0.08 0.51
3: 2010     1     C 0.54 0.14 0.29
4: 2011     1     A 0.49 0.21 0.60
5: 2011     1     B 0.52 0.33 0.28
6: 2011     1     C 0.49 0.98 0.97

# First I copy this data.table
data2 = data.table::copy(data1)
# I store the variable names that I will change
list_of_var = setdiff(colnames(data1), c("Year", "Group", "Class"))
# In data1 I change by reference all these variable with the
# transformation (1-X)*0.05
data1[, (list_of_var) := lapply(.SD, function(x) (1-x)*0.05),.SDcols = list_of_var]

# Then for each of my variables
for (variable in list_of_var){
   # I store the names of the other variables

  cols <- setdiff(colnames(data1), c(variable, "Year", "Group", "Class"))
  # and apply the transformation conditionally on value of Class
  for (var in cols){
    data2[Class == variable, (var) := data1[Class == variable, var, with = F]]
  }
}

# After doing this I will now apply the 1-B-C transformation for A conditionally
# on Class, and same for each variable
for (variable in list_of_var){
  other_vars = setdiff(list_of_var, variable)
  new_var = apply(data2[Class == variable, ..other_vars], MARGIN = 1, sum)
  data2[Class == variable, (variable) := 1 - new_var]

}
head(data2)

这是现在的结果：

  Year Group Class      A      B      C
1: 2010     1     A 0.9275 0.0440 0.0285
2: 2010     1     B 0.0225 0.9530 0.0245
3: 2010     1     C 0.0230 0.0430 0.9340
4: 2011     1     A 0.9405 0.0395 0.0200
5: 2011     1     B 0.0240 0.9400 0.0360
6: 2011     1     C 0.0255 0.0010 0.9735

【讨论】：

【解决方案3】：

其实它效率不高，但它有效，也许它有帮助

data1=data.frame(Year=c(2010,2010,2010,2011,2011,2011,2010,2010,2010,2011,2011,2011),
                Group=c(1,1,1,1,1,1,2,2,2,2,2,2),
                Class=c('A','B','C','A','B','C','A','B','C','A','B','C'),
                A=c(0.73,0.55,0.54,0.49,0.52,0.49,0.26,0.55,0.39,0.34,0.84,0.29),
                B=c(0.12,0.08,0.14,0.21,0.33,0.98,0.33,0.99,0.02,0.59,0.27,0.72),
                C=c(0.43,0.51,0.29,0.6,0.28,0.97,0.78,0.84,0.34,0.82,0.75,0.97))


let<-toupper(letters)

data2<-data1

data1[4:ncol(data1)]<-(1-data1[4:ncol(data1)])*0.05


for(i in 1:nrow(data1))
{
data1[i,(which(data2[i,3]==let)+3)]<-1-sum(data2[i,4:ncol(data2)][-which(data2[i,3]==let)])
}

【讨论】：