【问题标题】:Is there a better way to calculate differences between columns and reorder columns after using dcast?使用 dcast 后是否有更好的方法来计算列之间的差异并重新排序列?
【发布时间】:2019-06-14 07:48:12
【问题描述】:

以下是我的数据示例。我正在尝试为数据表创建数据,在我使用 dcast 函数后,数据必须以非常特定的顺序排列。我也试图计算一些列之间的差异。目标是按state、region、1_2017、1_2018、1_diff、2_2017、2_2018、2_diff等顺序获取数据。

我试图通过专门调用每一列来计算差异并对列进行排序,但这似乎是一种非常糟糕的方法,尤其是当我的实际数据超过 50 列时。下面是我使用的逻辑示例数据。

       library(reshape2)
    library(dplyr)



    #Data

    data<-data.frame("State"=c("AK","AK","AK","AK","AK","AK","AK","AK","AR","AR","AR","AR","AR","AR","AR","AR"),
                     "StoreRank" = c(1,1,1,1,2,2,2,2,1,1,1,1,2,2,2,2),
                     "Year" = c(2017,2018,2017,2018,2017,2018,2017,2018,2017,2018,2017,2018,2017,2018,2017,2018),
                     "Region" = c("East","East","West","West","East","East","West","West","East","East","West","West","East","East","West","West"),
                     "Store" = c("Ingles","Ingles","Ingles","Ingles","Safeway","Safeway","Safeway","Safeway","Albertsons","Albertsons","Albertsons","Albertsons","Safeway","Safeway","Safeway","Safeway"),
                     "Total" = c(500000,520000,480000,485000,600000,600000,500000,515000,500100,520100,480100,485100,601010,601000,501000,515100))



    #Formatting data for Data table
    data<-dcast(data, State+Region~StoreRank+Year, value.var = 'Total')

    #Function to calculate difference between columns
    diff_calculation <- function(data) {
      mutate(data,
             `1_diff` = data$`1_2018`-data$`1_2017`,
             `2_diff` = data$`2_2018`-data$`2_2017`)}

    #Applying difference calculation function
    reform.data<-diff_calculation(data)

    #Changes the column names from numbers to letter to try and order columns 
    names(reform.data)<-gsub(x = colnames(reform.data), pattern="1_", replacement = "a_")
    names(reform.data)<-gsub(x = colnames(reform.data), pattern="2_", replacement = "b_")


    #Trying to order columns as State, Region, 1_2017, 1_2018, 1_diff, 2_2017, 2_2018, 2_diff, etc.
    ordered.data<-reform.data[,order(names(reform.data))]

    final.data<-ordered.data %>%
      select('State', 'Region', 'a_2017', 'a_2018', 'a_diff', 'b_2017', 'b_2018', 'b_diff')

我希望在将 dcast 函数应用于具有大量列的数据后,找到一种更好的方法来计算列和排序列之间的差异。

【问题讨论】:

    标签: r dplyr reshape2 dt


    【解决方案1】:

    一种方法是使用长格式来处理这个问题,例如tidyverse:

    library(tidyverse)
    
    long_format <- data %>%
      mutate(
        StoreRank = ifelse(StoreRank == 1, "a", "b"),
        diff_col = paste(StoreRank, "diff", sep = "_"),
        Year = paste(StoreRank, Year, sep = "_")
      ) %>% group_by(State, Region, StoreRank) %>%
      mutate(diff = Total - lag(Total)) %>%
      fill(diff, .direction = "up") %>% ungroup()
    
    final_df <- bind_rows(
      long_format %>% select(State, Region, Year, Total),
      long_format %>% select(State, Region, Year = diff_col, Total = diff)) %>% 
      arrange(Year) %>%
      rowid_to_column %>%
      spread(Year, Total) %>%
      group_by(State, Region) %>%
      summarise_all(funs(first(na.omit(.)))) %>%
      select(-rowid)
    

    输出:

    # A tibble: 4 x 8
    # Groups:   State [2]
      State Region a_2017 a_2018 a_diff b_2017 b_2018 b_diff
      <fct> <fct>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
    1 AK    East   500000 520000  20000 600000 600000      0
    2 AK    West   480000 485000   5000 500000 515000  15000
    3 AR    East   500100 520100  20000 601010 601000    -10
    4 AR    West   480100 485100   5000 501000 515100  14100
    

    【讨论】:

    • 感谢您的帮助,但它并不完全有效。例如,如果我将排名更改为 3 和 12,则代码会对 12_2017、12_2018、12_diff、3_2018、3_2017 等列进行排名。
    猜你喜欢
    • 2020-01-02
    • 2019-07-23
    • 2017-02-26
    • 2016-05-04
    • 1970-01-01
    • 1970-01-01
    • 2021-11-08
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多