【问题标题】:Reshape large dataset with multiple columns from wide to long用从宽到长的多列重塑大型数据集
【发布时间】:2018-06-15 16:26:03
【问题描述】:

我有一个非常大的数据集,我需要将其从宽变长。

我的数据集看起来像:

  COMPANY   PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010 ... REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010 ... COSTSDEC2016
COMPANY A PRODUCT 1            6400           11050              6550               10600         8500        10400           9100             9850
COMPANY A PRODUCT 2            2700            3000              2800                3800         2850         2400           3100             3250
COMPANY B PRODUCT 3            5900            4150              5750                3750         4200         6100           2950             4600
COMPANY B PRODUCT 4             550             600                 0                 650          200          700            100              500
COMPANY B PRODUCT 5            1500            3750               550                2100         1850         1700           3150              450
COMPANY C PRODUCT 6           19300           17250             23600               21250        18200        26950          18200            23900

我希望它们看起来像:

  COMPANY    PRODUCT    DATE  REVENUES  COSTS
COMPANY A  PRODUCT 1  Dec-16     10600   9850
COMPANY A  PRODUCT 1  Feb-10     11050  10400
COMPANY A  PRODUCT 1  Jan-10      6400   8500
COMPANY A  PRODUCT 1  Mar-10      6550   9100
COMPANY A  PRODUCT 2  Dec-16      3800   3250
COMPANY A  PRODUCT 2  Feb-10      3000   2400
COMPANY A  PRODUCT 2  Jan-10      2700   2850
COMPANY A  PRODUCT 2  Mar-10      2800   3100
COMPANY B  PRODUCT 3  Dec-16      3750   4600
COMPANY B  PRODUCT 3  Feb-10      4150   6100
COMPANY B  PRODUCT 3  Jan-10      5900   4200
COMPANY B  PRODUCT 3  Mar-10      5750   2950
COMPANY B  PRODUCT 4  Dec-16       650    500
COMPANY B  PRODUCT 4  Feb-10       600    700
COMPANY B  PRODUCT 4  Jan-10       550    200
COMPANY B  PRODUCT 4  Mar-10         0    100
COMPANY B  PRODUCT 5  Dec-16      2100    450
COMPANY B  PRODUCT 5  Feb-10      3750   1700
COMPANY B  PRODUCT 5  Jan-10      1500   1850
COMPANY B  PRODUCT 5  Mar-10       550   3150
COMPANY C  PRODUCT 6  Dec-16     21250  23900
COMPANY C  PRODUCT 6  Feb-10     17250  26950
COMPANY C  PRODUCT 6  Jan-10     19300  18200
COMPANY C  PRODUCT 6  Mar-10     23600  18200

在 Stata 中,我会输入 reshape long REVENUES COSTS, i(COMPANY PRODUCT) j(DATE) string

我如何在 R 中做到这一点?

【问题讨论】:

    标签: r reshape


    【解决方案1】:

    还有其他几种方法可以比已经建议的“tidyverse”选项更精简一些。

    以下所有示例都使用from JMT2080AD's answerset.seed(1) 的样本数据(为了重现性)。

    选项 1:基础 R 的 reshape

    它并不总是最容易使用的功能,但reshape 功能一旦你弄清楚它就非常强大。在这种情况下,您没有sep,这让事情变得有点棘手,因为您必须更具体地了解结果变量名称和应该显示为“时间”的值(默认情况下,它们只是序列号)。

    times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
    reshape(yourData, direction = "long", 
            varying = grep("revenues|cost", names(yourData)), sep = "", 
            v.names = c("revenues", "cost"), timevar = "date", times = times)
    #             company   product    date revenues cost id
    # 1.Jan2010 Company A Product 1 Jan2010     2862 1164  1
    # 2.Jan2010 Company A Product 2 Jan2010     2152 1430  2
    # 3.Jan2010 Company B Product 3 Jan2010     2073 1932  3
    # 4.Jan2010 Company B Product 4 Jan2010      654 2771  4
    # 5.Jan2010 Company B Product 5 Jan2010     1015 1004  5
    # 6.Jan2010 Company C Product 6 Jan2010      941 2746  6
    # ....
    

    这正是您要查找的内容,可能在日期格式上有所不同。

    选项 2:data.table

    如果您追求的是性能,您可以查看“data.table”中的melt,您应该可以使用它执行以下操作。与reshape 方法一样,您需要存储“时间”以在melting 数据之后重新引入日期。

    (注意:我知道这与@Uwe's approach非常相似。)

    library(data.table)
    times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
    melt(as.data.table(yourData), measure.vars = patterns("revenues", "cost"),
         value.name = c("revenues", "cost"))[
           , variable := factor(variable, labels = times)][]
    #       company   product variable revenues cost
    #  1: Company A Product 1  Jan2010     1164 1168
    #  2: Company A Product 2  Jan2010     1430 1465
    #  3: Company B Product 3  Jan2010     1932  533
    #  4: Company B Product 4  Jan2010     2771 1456
    #  5: Company B Product 5  Jan2010     1004 2674
    # ---                                           
    # 20: Company A Product 2  Apr2010     2444 1883
    # 21: Company B Product 3  Apr2010     2837 1824
    # 22: Company B Product 4  Apr2010     1030 2473
    # 23: Company B Product 5  Apr2010     2129  558
    # 24: Company C Product 6  Apr2010      814 1693
    

    选项 3:merged.stack

    我的“splitstackshape” pacakge 有一个名为merged.stack 的函数,它试图使这种特殊的整形更容易进行。有了它,你可以试试:

    library(splitstackshape)
    merged.stack(yourData, var.stubs = c("revenues", "cost"), sep = "var.stubs")
    #       company   product .time_1 revenues cost
    #  1: Company A Product 1 Apr2010     1450 2457
    #  2: Company A Product 1 Feb2010     2862 1705
    #  3: Company A Product 1 Jan2010     1164 1168
    #  4: Company A Product 1 Mar2010     2218 2486
    #  5: Company A Product 2 Apr2010     2444 1883
    #  6: Company A Product 2 Feb2010     2152 1999
    #  7: Company A Product 2 Jan2010     1430 1465
    #  8: Company A Product 2 Mar2010     1460  770
    #  9: Company B Product 3 Apr2010     2837 1824
    # 10: Company B Product 3 Feb2010     2073 1734
    # ... 
    

    有一天,我将开始更新函数,该函数是在“data.table”中的melt 可以处理半宽输出格式之前编写的。我已经想出了a partial solution,但后来我不再摆弄它了。

    其实,使用上面的链接函数,解决方法会很简单:

    ReshapeLong_(yourData, c("revenues", "cost"))
    

    选项 4:来自“tidyverse”的extract

    使用 tidyverse 的其他解决方案似乎以一种非常奇怪的方式处理事情。更好的解决方案是使用extract 将您需要的数据放入新列。您必须先将gather 数据转换为很长的格式,然后将spread 数据转换为宽格式。

    这是我将使用的方法:

    library(tidyverse)
    yourData %>% 
      gather(var, val, -company, -product) %>%
      extract(var, into = c("type", "month", "year"), 
              regex = ("(revenues|cost)(...)(.*)")) %>%
      spread(type, val)
    #      company   product month year cost revenues
    # 1  Company A Product 1   Apr 2010 2457     1450
    # 2  Company A Product 1   Feb 2010 1705     2862
    # 3  Company A Product 1   Jan 2010 1168     1164
    # 4  Company A Product 1   Mar 2010 2486     2218
    # 5  Company A Product 2   Apr 2010 1883     2444
    # 6  Company A Product 2   Feb 2010 1999     2152
    # ...
    

    【讨论】:

      【解决方案2】:

      这里的棘手之处在于您将日期打包到列名中。在您可以按照自己的意愿制作表格之前,必须先解析这些内容。我遍历了每一列,解析每个子表的列名以获取观察的日期和类型,绑定每个子表,然后根据成本/收入进行转换。我敢肯定有一个更优雅的解决方案。

      library(reshape)
      
      ## making a table similar to yours here
      yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
                             product = paste("Product", 1:6),
                             revenuesJan2010 = round(runif(6, 500, 3000)),
                             revenuesFeb2010 = round(runif(6, 500, 3000)),
                             revenuesMar2010 = round(runif(6, 500, 3000)),
                             revenuesApr2010 = round(runif(6, 500, 3000)),
                             costJan2010 = round(runif(6, 500, 3000)),
                             costFeb2010 = round(runif(6, 500, 3000)),
                             costMar2010 = round(runif(6, 500, 3000)),
                             costApr2010 = round(runif(6, 500, 3000)))
      
      ## a function that parses the date from the column name
      columnParse <- function(tab){
          colNm   <- names(tab)[3]
          names(tab)[3] <- "value"
          colDate  <- strsplit(colNm, "revenues|cost")[[1]][2]
          colDate  <- gsub("([A-Za-z]+)", "\\1-", colDate)
          tab$date <- colDate
          tab$type <- gsub("(revenues|cost).*", "\\1", colNm)
          return(tab)
      }
      
      ## running that function against sub tables of your data, then binding
      yourDataLong <- do.call(rbind,
                              lapply(3:ncol(yourData),
                                     function(x) columnParse(yourData[c(1:2, x)])))
      
      ## casting your data on cost/revenue
      yourDataCast <- cast(yourDataLong, company+product+date~type, value = "value")
      

      【讨论】:

      • 据我所知,“reshape”包并未得到积极开发。您可能想要切换到“reshape2”、“data.table”或“tidyr”......
      【解决方案3】:

      这是使用tidyversestringr 的另一个选项:

      yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
                         product = paste("Product", 1:6),
                         REVENUESJan2010 = round(runif(6, 500, 3000)),
                         REVENUESFeb2010 = round(runif(6, 500, 3000)),
                         REVENUESMar2010 = round(runif(6, 500, 3000)),
                         REVENUESApr2010 = round(runif(6, 500, 3000)),
                         COSTSJan2010 = round(runif(6, 500, 3000)),
                         COSTSFeb2010 = round(runif(6, 500, 3000)),
                         COSTSMar2010 = round(runif(6, 500, 3000)),
                         COSTSApr2010 = round(runif(6, 500, 3000)))
      

      使用tidyversestringr的解决方案:

      library(tidyverse)
      library(stringr)
      
      newData <- yourData %>%
         gather(key = rev.cost.date, value, -company, -product) %>%
         mutate(finance.type = ifelse(str_detect(rev.cost.date, fixed("REVENUES")), "REVENUES", "COSTS")) %>%
         mutate(date = str_replace(rev.cost.date, "REVENUES|COSTS", "")) %>%
         select(-rev.cost.date) %>%
         spread(value = value, key = finance.type) %>%
         mutate(date = paste0(str_sub(date, 0, 3), "-", str_sub(date, 4,8))
      

      【讨论】:

        【解决方案4】:

        截至 1.9.6 版(CRAN 2015 年 9 月 19 日),data.table 可以同时熔化多个列(使用 patterns() 函数)。因此,以REVENUESCOSTS 开头的列可以合并为两个单独的列。

        此外,日期(月份)被打包到不带分隔符的列名中。这些是使用带有后视功能的正则表达式从列名中提取的,用于替换DATE 列的因子水平。

        library(data.table)
        library(magrittr)
        cols <- c("REVENUES", "COSTS")
        long <- melt(wide, measure.vars = patterns(cols), value.name = cols, variable.name = "DATE")
        months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() 
        long[, DATE := forcats::lvls_revalue(DATE, months)]
        long
        
              COMPANY   PRODUCT      DATE REVENUES COSTS
         1: COMPANY A PRODUCT 1   JAN2010     6400  8500
         2: COMPANY A PRODUCT 2   JAN2010     2700  2850
         3: COMPANY B PRODUCT 3   JAN2010     5900  4200
         4: COMPANY B PRODUCT 4   JAN2010      550   200
         5: COMPANY B PRODUCT 5   JAN2010     1500  1850
         6: COMPANY C PRODUCT 6   JAN2010    19300 18200
         7: COMPANY A PRODUCT 1   FEB2010    11050 10400
         8: COMPANY A PRODUCT 2   FEB2010     3000  2400
         9: COMPANY B PRODUCT 3   FEB2010     4150  6100
        10: COMPANY B PRODUCT 4   FEB2010      600   700
        11: COMPANY B PRODUCT 5   FEB2010     3750  1700
        12: COMPANY C PRODUCT 6   FEB2010    17250 26950
        13: COMPANY A PRODUCT 1 MARCH2010     6550  9100
        14: COMPANY A PRODUCT 2 MARCH2010     2800  3100
        15: COMPANY B PRODUCT 3 MARCH2010     5750  2950
        16: COMPANY B PRODUCT 4 MARCH2010        0   100
        17: COMPANY B PRODUCT 5 MARCH2010      550  3150
        18: COMPANY C PRODUCT 6 MARCH2010    23600 18200
        19: COMPANY A PRODUCT 1   DEC2016    10600  9850
        20: COMPANY A PRODUCT 2   DEC2016     3800  3250
        21: COMPANY B PRODUCT 3   DEC2016     3750  4600
        22: COMPANY B PRODUCT 4   DEC2016      650   500
        23: COMPANY B PRODUCT 5   DEC2016     2100   450
        24: COMPANY C PRODUCT 6   DEC2016    21250 23900
              COMPANY   PRODUCT      DATE REVENUES COSTS
        

        编辑:使用 ISO 月份命名方案进行正确排序

        使用字母月份名称和年份的命名方案不允许正确按DATE对数据进行排序。 DEC2016FEB2010 之前,FEB2010JAN2010 之前。 ISO 8601 命名约定将年份放在首位,然后是月数。

        我们可以如下使用这个命名方案:

        months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() %>%
          paste0("01", .) %>% lubridate::dmy() %>% format("%Y-%m")
        long[, DATE := forcats::lvls_revalue(DATE, months)]
        long
        
              COMPANY   PRODUCT    DATE REVENUES COSTS
         1: COMPANY A PRODUCT 1 2010-01     6400  8500
         2: COMPANY A PRODUCT 2 2010-01     2700  2850
         3: COMPANY B PRODUCT 3 2010-01     5900  4200
         4: COMPANY B PRODUCT 4 2010-01      550   200
         5: COMPANY B PRODUCT 5 2010-01     1500  1850
         6: COMPANY C PRODUCT 6 2010-01    19300 18200
         7: COMPANY A PRODUCT 1 2010-02    11050 10400
         8: COMPANY A PRODUCT 2 2010-02     3000  2400
         9: COMPANY B PRODUCT 3 2010-02     4150  6100
        10: COMPANY B PRODUCT 4 2010-02      600   700
        11: COMPANY B PRODUCT 5 2010-02     3750  1700
        12: COMPANY C PRODUCT 6 2010-02    17250 26950
        13: COMPANY A PRODUCT 1 2010-03     6550  9100
        14: COMPANY A PRODUCT 2 2010-03     2800  3100
        15: COMPANY B PRODUCT 3 2010-03     5750  2950
        16: COMPANY B PRODUCT 4 2010-03        0   100
        17: COMPANY B PRODUCT 5 2010-03      550  3150
        18: COMPANY C PRODUCT 6 2010-03    23600 18200
        19: COMPANY A PRODUCT 1 2016-12    10600  9850
        20: COMPANY A PRODUCT 2 2016-12     3800  3250
        21: COMPANY B PRODUCT 3 2016-12     3750  4600
        22: COMPANY B PRODUCT 4 2016-12      650   500
        23: COMPANY B PRODUCT 5 2016-12     2100   450
        24: COMPANY C PRODUCT 6 2016-12    21250 23900
              COMPANY   PRODUCT    DATE REVENUES COSTS
        

        数据

        library(data.table)
        wide <- data.table(
        readr::read_table(
        "  COMPANY   PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010     REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010     COSTSDEC2016
        COMPANY A PRODUCT 1            6400           11050              6550               10600         8500        10400           9100             9850
        COMPANY A PRODUCT 2            2700            3000              2800                3800         2850         2400           3100             3250
        COMPANY B PRODUCT 3            5900            4150              5750                3750         4200         6100           2950             4600
        COMPANY B PRODUCT 4             550             600                 0                 650          200          700            100              500
        COMPANY B PRODUCT 5            1500            3750               550                2100         1850         1700           3150              450
        COMPANY C PRODUCT 6           19300           17250             23600               21250        18200        26950          18200            23900"
        ))
        

        【讨论】:

          【解决方案5】:

          我认为在 R 中将宽改成长的最明确(即无需重命名变量)的方法是使用基本 R reshape() 函数并将要“堆叠”的可变列指定为 list .请参阅this 博客文章。

          我将使用来自JMT2080AD's answer 的数据并将种子设置为set.seed(789)

          ### Create a list of the variables you want to reshape/stack
          reshape.vars <- list(c("revenuesJan2010",   "revenuesFeb2010",  "revenuesMar2010",  "revenuesApr2010"), # revenues
                               c("costJan2010",   "costFeb2010",  "costMar2010",  "costApr2010")) # cost 
          ### reshape wide to long
          reshape(yourData,                      #dataframe
                  direction="long",             #wide to long
                  varying=reshape.vars, #repeated measures list of indexes for vars to stack/reshape
                  timevar="date",              #the repeated measures times
                  v.names=c("revenues", "cost")) #the repeated measures names
          
          #     company   product date   revenues cost id
          # 1.1 Company A Product 1    1     2250 1574  1
          # 2.1 Company A Product 2    1      734 1793  2
          # 3.1 Company B Product 3    1      530 1282  3
          # 4.1 Company B Product 4    1     1979 1741  4
          # 5.1 Company B Product 5    1     1730 2558  5
          # 6.1 Company C Product 6    1      550 1757  6
          # 1.2 Company A Product 1    2     1932 1048  1
          #...
          # 5.3 Company B Product 5    3      890 1103  5
          # 6.3 Company C Product 6    3     2113 2469  6
          # 1.4 Company A Product 1    4     2426 2382  1
          # 2.4 Company A Product 2    4      778 2995  2
          # 3.4 Company B Product 3    4     1359  989  3
          # 4.4 Company B Product 4    4     1618  912  4
          # 5.4 Company B Product 5    4      895 2109  5
          # 6.4 Company C Product 6    4     1258 2803  6
          

          使用list 方法

          • 您不必重命名变量
          • 由于您要创建的变量已在列表中明确定义,因此reshape() 推断应堆叠哪些变量不会出现错误

          我发现即使有 100 多个变量需要重新调整,如果重命名它们可能很麻烦,那么使用复制/粘贴来创建可变变量列表也不会花费那么长时间。

          【讨论】:

            【解决方案6】:

            作为一个喜欢在 stata 中重塑的 stata 到 r 转换,我发现 tidyr::gather 和 tidyr::spread 非常直观。聚集基本上是重塑长,传播是重塑宽。

            以下代码可将您的数据更改为您想要的方式:

            new_data <- 
            gather(data = your-data-frame, 
                   key = var_holder,
                   value = val_holder,
                   -company,
                   -product) 
            
            new_data$var_holder <- sub("REVENUE", "cost_", new_data$var_holder)                                     
            new_data$var_holder <- sub("COST", "cost_", new_data$var_holder)
            
            new_data <- 
                separate(data = new_data,
                         col = var_holder,
                         into = c("var", "date")) %>%
                spread(key = var,
                       value = val_holder)
            

            完成了!

            gather 通过获取所有指定的变量名称(或在此,未指定,请注意前面有“-”符号的两个变量),并将它们放在名称由“key = .. .”(创建新行)。然后,它获取属于这些变量的值,并将它们放在名称由“value = ...”指定的单个变量下。

            spread 的作用是相反的。希望这会有所帮助!

            【讨论】:

              【解决方案7】:

              使用tidyr 开发版本的选项(版本 - '0.8.3.9000')

              library(dplyr)
              library(tidyr)
              library(stringr)
              library(zoo)
              library(readr)
              
              df1 %>% 
                 rename_at(3:ncol(.), ~ str_replace(., "^(REVENUES|COSTS)", "\\1_")) %>%
                 pivot_longer(c(-COMPANY, -PRODUCT), names_to = c(".value", "DATE"), names_sep = "_") %>% 
                 mutate(DATE = format(as.yearmon(DATE), "%b-%Y"))
              # A tibble: 24 x 5
              #   COMPANY   PRODUCT   DATE     REVENUES COSTS
              #   <chr>     <chr>     <chr>       <dbl> <dbl>
              # 1 COMPANY A PRODUCT 1 Jan-2010     6400  8500
              # 2 COMPANY A PRODUCT 1 Feb-2010    11050 10400
              # 3 COMPANY A PRODUCT 1 Mar-2010     6550  9100
              # 4 COMPANY A PRODUCT 1 Dec-2016    10600  9850
              # 5 COMPANY A PRODUCT 2 Jan-2010     2700  2850
              # 6 COMPANY A PRODUCT 2 Feb-2010     3000  2400
              # 7 COMPANY A PRODUCT 2 Mar-2010     2800  3100
              # 8 COMPANY A PRODUCT 2 Dec-2016     3800  3250
              # 9 COMPANY B PRODUCT 3 Jan-2010     5900  4200
              #10 COMPANY B PRODUCT 3 Feb-2010     4150  6100
              # … with 14 more rows
              

              数据

              df1 <- structure(list(COMPANY = c("COMPANY A", "COMPANY A", "COMPANY B", 
              "COMPANY B", "COMPANY B", "COMPANY C"), PRODUCT = c("PRODUCT 1", 
              "PRODUCT 2", "PRODUCT 3", "PRODUCT 4", "PRODUCT 5", "PRODUCT 6"
              ), REVENUESJAN2010 = c(6400, 2700, 5900, 550, 1500, 19300), REVENUESFEB2010 = c(11050, 
              3000, 4150, 600, 3750, 17250), REVENUESMARCH2010 = c(6550, 2800, 
              5750, 0, 550, 23600), REVENUESDEC2016 = c(10600, 3800, 3750, 
              650, 2100, 21250), COSTSJAN2010 = c(8500, 2850, 4200, 200, 1850, 
              18200), COSTSFEB2010 = c(10400, 2400, 6100, 700, 1700, 26950), 
                  COSTSMARCH2010 = c(9100, 3100, 2950, 100, 3150, 18200), COSTSDEC2016 = c(9850, 
                  3250, 4600, 500, 450, 23900)), class = c("spec_tbl_df", "tbl_df", 
              "tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
                  cols = list(COMPANY = structure(list(), class = c("collector_character", 
                  "collector")), PRODUCT = structure(list(), class = c("collector_character", 
                  "collector")), REVENUESJAN2010 = structure(list(), class = c("collector_double", 
                  "collector")), REVENUESFEB2010 = structure(list(), class = c("collector_double", 
                  "collector")), REVENUESMARCH2010 = structure(list(), class = c("collector_double", 
                  "collector")), REVENUESDEC2016 = structure(list(), class = c("collector_double", 
                  "collector")), COSTSJAN2010 = structure(list(), class = c("collector_double", 
                  "collector")), COSTSFEB2010 = structure(list(), class = c("collector_double", 
                  "collector")), COSTSMARCH2010 = structure(list(), class = c("collector_double", 
                  "collector")), COSTSDEC2016 = structure(list(), class = c("collector_double", 
                  "collector"))), default = structure(list(), class = c("collector_guess", 
                  "collector")), skip = 1), class = "col_spec"))
              

              【讨论】:

                猜你喜欢
                • 1970-01-01
                • 2015-11-05
                • 1970-01-01
                • 1970-01-01
                • 2016-05-25
                • 2018-05-14
                • 2019-11-22
                • 1970-01-01
                • 1970-01-01
                相关资源
                最近更新 更多