跨行列值之间的差异答案

【问题标题】：difference between the colum values across rows跨行列值之间的差异
【发布时间】：2018-01-18 08:33:25
【问题描述】：

使用如下所示的数据框

text <- "
location_id,brand,count,driven_km,efficiency,mileage,age
23040204995,Toyota,8,2761,0.57,333,2.17
23040204995,Honda,23,2307,0.38,117.5,0.45
23040204995,Tesla,16,3578,0.65,127,0.38
23040204996,Toyota,16,3578,0.65,127,0.38
23040204996,Nissan,38,2504,0.37,563.5,0.74
23040204996,Tesla,24,892,0.32,175,0.48
23040204997,Tesla,11,1879.5,0.67,298.5,0.57
23040204998,Honda,24,892,0.32,175,0.48
"
df <- read.table(textConnection(text), sep=",", header = T)

对于每个location_id，我需要计算所有品牌的count,driven_km,efficiency,mileage,age 值与Tesla 值的差异。不同的需要计算使得Value for i - Value for Tesla where i={"Toyota", "Honda", "Nissan" ..}。有location_ids 的值Tesla 可能不存在或可能只存在Tesla 的值，它们需要被忽略，因为差异对于那些location_ids 没有意义。

我正在寻找一种优雅的方式来做到这一点 - 最好是 dplyr 方式。

预期输出

location_id,brand,count,driven_km,efficiency,mileage,age
23040204995,Toyota,-8,-817,-0.08,206,1.79
23040204995,Honda,7,-1271,-0.27,-9.5,0.07
23040204996,Toyota,-8,2686,0.33,-48,-0.1
23040204996,Nissan,14,1612,0.05,388.5,0.26

【问题讨论】：

标签： r dplyr

【解决方案1】：

使用data.table，按'location_id'分组，我们在.SDcols中指定要区分的列，通过循环遍历Data.table的子集（.SD）得到差异

library(data.table)
setDT(df)[, lapply(.SD, function(x) x[brand != "Tesla"] - 
      x[brand == "Tesla"]), location_id, .SDcols = count:age]

如果还需要对应的“品牌”栏目

setDT(df)[, c(list(brand = brand), lapply(.SD, function(x) if("Tesla" %in% brand) 
   as.numeric(x - x[brand == "Tesla"]) else NA_real_)), location_id, .SDcols = count:age
      ][brand != "Tesla" & !is.na(count)]
#  location_id  brand count driven_km efficiency mileage   age
#1: 23040204995 Toyota    -8      -817      -0.08   206.0  1.79
#2: 23040204995  Honda     7     -1271      -0.27    -9.5  0.07
#3: 23040204996 Toyota    -8      2686       0.33   -48.0 -0.10
#4: 23040204996 Nissan    14      1612       0.05   388.5  0.26

或者如果我们使用tidyverse

library(dplyr)
library(tidyr)
gather(df, key, val, count:age) %>%
   group_by(location_id, key) %>%
   filter("Toyota" %in% brand) %>% 
   mutate(val = val- val[brand == "Tesla"]) %>% 
   filter(brand != "Tesla") %>% 
   ungroup %>%
   mutate_at(vars(brand, key), funs(factor(., levels = unique(.)))) %>% 
   spread(key, val)
# A tibble: 4 x 7
#  location_id brand   count driven_km efficiency mileage     age
#*       <dbl> <fctr>  <dbl>     <dbl>      <dbl>   <dbl>   <dbl>
#1 23040204995 Toyota - 8.00     - 817    -0.0800  206     1.79  
#2 23040204995 Honda    7.00     -1271    -0.270  -  9.50  0.0700
#3 23040204996 Toyota - 8.00      2686     0.330  - 48.0  -0.100 
#4 23040204996 Nissan  14.0       1612     0.0500  388     0.260

【讨论】：

有趣的 tidyverse 与我的建议非常相似。 filter("Toyota" %in% brand) 是做什么的？
@bdecaf 我使用的gather/spread 与您的类似，但您使用的inner_join 是我没有使用的。 filter 是只保留那些在“品牌”列中有“丰田”的组
我明白了 - 我喜欢你的回答如何保持排序。是mutate_at(vars(brand, key), funs(factor(., levels = unique(.)))) 做到了这一点吗？
@bdecaf 是的，spread 将按字母顺序重新排序。为了避免这种情况，我做了因素的方式。谢谢

【解决方案2】：

所以我会通过tidyr 来使它成为dplyr 喜欢。

library(tidyr)

dfl <- gather(df, "key", "value", -location_id, -brand)
dflt <- dfl %>% filter(brand == "Tesla")
dfln <- dfl %>% filter(brand != "Tesla")

inner_join(dflt,  dfln, by = c("location_id", "key")) %>% 
    mutate(value = value.y - value.x) %>% 
    select(location_id, brand = brand.y, key, value) %>% 
    spread(key,value)

#   location_id  brand   age count driven_km efficiency mileage
# 1 23040204995  Honda  0.07     7     -1271      -0.27    -9.5
# 2 23040204995 Toyota  1.79    -8      -817      -0.08   206.0
# 3 23040204996 Nissan  0.26    14      1612       0.05   388.5
# 4 23040204996 Toyota -0.10    -8      2686       0.33   -48.0

列的顺序不同 - 但您可以重新排列它们。

【讨论】：