【问题标题】:difference between the colum values across rows跨行列值之间的差异
【发布时间】:2018-01-18 08:33:25
【问题描述】:

使用如下所示的数据框

text <- "
location_id,brand,count,driven_km,efficiency,mileage,age
23040204995,Toyota,8,2761,0.57,333,2.17
23040204995,Honda,23,2307,0.38,117.5,0.45
23040204995,Tesla,16,3578,0.65,127,0.38
23040204996,Toyota,16,3578,0.65,127,0.38
23040204996,Nissan,38,2504,0.37,563.5,0.74
23040204996,Tesla,24,892,0.32,175,0.48
23040204997,Tesla,11,1879.5,0.67,298.5,0.57
23040204998,Honda,24,892,0.32,175,0.48
"
df <- read.table(textConnection(text), sep=",", header = T)

对于每个location_id,我需要计算所有品牌的count,driven_km,efficiency,mileage,age 值与Tesla 值的差异。不同的需要计算使得Value for i - Value for Tesla where i={"Toyota", "Honda", "Nissan" ..}。有location_ids 的值Tesla 可能不存在或可能只存在Tesla 的值,它们需要被忽略,因为差异对于那些location_ids 没有意义。

我正在寻找一种优雅的方式来做到这一点 - 最好是 dplyr 方式。

预期输出

location_id,brand,count,driven_km,efficiency,mileage,age
23040204995,Toyota,-8,-817,-0.08,206,1.79
23040204995,Honda,7,-1271,-0.27,-9.5,0.07
23040204996,Toyota,-8,2686,0.33,-48,-0.1
23040204996,Nissan,14,1612,0.05,388.5,0.26

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    使用data.table,按'location_id'分组,我们在.SDcols中指定要区分的列,通过循环遍历Data.table的子集(.SD)得到差异

    library(data.table)
    setDT(df)[, lapply(.SD, function(x) x[brand != "Tesla"] - 
          x[brand == "Tesla"]), location_id, .SDcols = count:age]
    

    如果还需要对应的“品牌”栏目

    setDT(df)[, c(list(brand = brand), lapply(.SD, function(x) if("Tesla" %in% brand) 
       as.numeric(x - x[brand == "Tesla"]) else NA_real_)), location_id, .SDcols = count:age
          ][brand != "Tesla" & !is.na(count)]
    #  location_id  brand count driven_km efficiency mileage   age
    #1: 23040204995 Toyota    -8      -817      -0.08   206.0  1.79
    #2: 23040204995  Honda     7     -1271      -0.27    -9.5  0.07
    #3: 23040204996 Toyota    -8      2686       0.33   -48.0 -0.10
    #4: 23040204996 Nissan    14      1612       0.05   388.5  0.26
    

    或者如果我们使用tidyverse

    library(dplyr)
    library(tidyr)
    gather(df, key, val, count:age) %>%
       group_by(location_id, key) %>%
       filter("Toyota" %in% brand) %>% 
       mutate(val = val- val[brand == "Tesla"]) %>% 
       filter(brand != "Tesla") %>% 
       ungroup %>%
       mutate_at(vars(brand, key), funs(factor(., levels = unique(.)))) %>% 
       spread(key, val)
    # A tibble: 4 x 7
    #  location_id brand   count driven_km efficiency mileage     age
    #*       <dbl> <fctr>  <dbl>     <dbl>      <dbl>   <dbl>   <dbl>
    #1 23040204995 Toyota - 8.00     - 817    -0.0800  206     1.79  
    #2 23040204995 Honda    7.00     -1271    -0.270  -  9.50  0.0700
    #3 23040204996 Toyota - 8.00      2686     0.330  - 48.0  -0.100 
    #4 23040204996 Nissan  14.0       1612     0.0500  388     0.260 
    

    【讨论】:

    • 有趣的 tidyverse 与我的建议非常相似。 filter("Toyota" %in% brand) 是做什么的?
    • @bdecaf 我使用的gather/spread 与您的类似,但您使用的inner_join 是我没有使用的。 filter 是只保留那些在“品牌”列中有“丰田”的组
    • 我明白了 - 我喜欢你的回答如何保持排序。是mutate_at(vars(brand, key), funs(factor(., levels = unique(.)))) 做到了这一点吗?
    • @bdecaf 是的,spread 将按字母顺序重新排序。为了避免这种情况,我做了因素的方式。谢谢
    【解决方案2】:

    所以我会通过tidyr 来使它成为dplyr 喜欢。

    library(tidyr)
    
    dfl <- gather(df, "key", "value", -location_id, -brand)
    dflt <- dfl %>% filter(brand == "Tesla")
    dfln <- dfl %>% filter(brand != "Tesla")
    
    inner_join(dflt,  dfln, by = c("location_id", "key")) %>% 
        mutate(value = value.y - value.x) %>% 
        select(location_id, brand = brand.y, key, value) %>% 
        spread(key,value)
    
    #   location_id  brand   age count driven_km efficiency mileage
    # 1 23040204995  Honda  0.07     7     -1271      -0.27    -9.5
    # 2 23040204995 Toyota  1.79    -8      -817      -0.08   206.0
    # 3 23040204996 Nissan  0.26    14      1612       0.05   388.5
    # 4 23040204996 Toyota -0.10    -8      2686       0.33   -48.0
    

    列的顺序不同 - 但您可以重新排列它们。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2013-07-12
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-12-24
      • 2018-05-15
      相关资源
      最近更新 更多