【问题标题】:How to subset data frame by date and perform multiple operations in R?如何按日期对数据帧进行子集化并在 R 中执行多个操作?
【发布时间】:2020-06-26 21:18:59
【问题描述】:

我每天都会收到 CSV 报告,每个报告都有相同数量的变量,但时间不同。我想根据日期运行一些简单的分析并保存结果。我认为for 循环可以完成这项工作,但我只知道基础知识。理想情况下,我只需要每月运行一次脚本并获得结果。任何指导或建议表示赞赏。

假设我在一个文件夹中有两个 CSV 报告:

#File 1 - 20200624.csv
Date        Market  Salesman    Product Quantity    Price   Cost
6/24/2020   A       MF          Apple   20          1       0.5
6/24/2020   A       RP          Apple   15          1       0.5
6/24/2020   A       RP          Banana  20          2       0.5
6/24/2020   A       FR          Orange  20          3       0.5
6/24/2020   B       MF          Apple   20          1       0.5
6/24/2020   B       RP          Banana  20          2       0.5

#File 2 - 20200625.csv
Date        Market  Salesman    Product Quantity    Price   Cost
6/25/2020   A       MF          Apple   10          1       0.6
6/25/2020   A       MF          Banana  15          1       0.6
6/25/2020   A       RP          Banana  10          2       0.6
6/25/2020   A       FR          Orange  15          3       0.6
6/25/2020   B       MF          Apple   20          1       0.6
6/25/2020   B       RP          Banana  20          2       0.6

我使用以下代码将所有文件导入 R:

library(readr)
library(dplyr)

#Import files
files <- list.files(path = "~/JuneReports", 
                    pattern = "*.csv", full.names = T)
tbl <- sapply(files, read_csv, simplify=FALSE) %>% 
  bind_rows(.id = "id")
#Remove the "id" column
tbl2 <- tbl[,-1]
#Subset the data frame to get only Mark A, as Market B is irrelavant.
tbl3 <- subset(tbl2, Market == "A")
head(tbl3)
# A tibble: 6 x 7
  Date      Market Salesman Product Quantity Price  Cost
  <chr>     <chr>  <chr>    <chr>      <dbl> <dbl> <dbl>
1 6/24/2020 A      MF       Apple         20     1   0.5
2 6/24/2020 A      RP       Apple         15     1   0.5
3 6/24/2020 A      RP       Banana        20     2   0.5
4 6/24/2020 A      FR       Orange        20     3   0.5
5 6/25/2020 A      MF       Apple         10     1   0.6
6 6/25/2020 A      MF       Banana        15     1   0.6

以下是我想要得到的结果:

Date        Market  Revenue Total Cost  Apples Sold Bananas Sold    Oranges Sold
6/24/2020   A       135     37.5        35          20              20
6/25/2020   A       90      30          15          25              15

#Revenue = sumproduct(Quantity, Price)
#Total Cost = sumproduct(Quantity, Cost)
#Apples/Bananas/Oranges Sold = sum(Product == "Apple/Banana/Orange")

【问题讨论】:

  • 您可以使用%*%
  • @akrun 你能提供更多细节吗?
  • 我的解决方案输出基于您显示的head 数据

标签: r for-loop


【解决方案1】:

我们按“日期”、“市场”分组,计算“数量”与“价格”和“成本”的乘积之和,.add 也在group_by 中与“产品”一起,得到'Quantity' 的sum 并使用pivot_wider 重塑为'wide' 格式

library(dplyr) # 1.0.0
library(tidyr)
df1 %>%
    group_by(Date, Market) %>% 
    group_by(Revenue = c(Quantity %*% Price), 
             TotalCost = c(Quantity %*% Cost),
             Product, .add = TRUE) %>% 
    summarise(Sold = sum(Quantity)) %>% 
    pivot_wider(names_from = Product, values_from = Sold)
# A tibble: 2 x 7
# Groups:   Date, Market, Revenue, TotalCost [2]
#  Date      Market Revenue TotalCost Apple Banana Orange
#  <chr>     <chr>    <dbl>     <dbl> <int>  <int>  <int>
#1 6/24/2020 A          135      37.5    35     20     20
#2 6/25/2020 A           25      15      10     15     NA

数据

df1 <- structure(list(Date = c("6/24/2020", "6/24/2020", "6/24/2020", 
"6/24/2020", "6/25/2020", "6/25/2020"), Market = c("A", "A", 
"A", "A", "A", "A"), Salesman = c("MF", "RP", "RP", "FR", "MF", 
"MF"), Product = c("Apple", "Apple", "Banana", "Orange", "Apple", 
"Banana"), Quantity = c(20L, 15L, 20L, 20L, 10L, 15L), Price = c(1L, 
1L, 2L, 3L, 1L, 1L), Cost = c(0.5, 0.5, 0.5, 0.5, 0.6, 0.6)), 
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

【讨论】:

  • 太棒了!但是,我想你想要add = TRUE 而不是.add = TRUE
  • @KJM IN dplyr 1.0.0 group_by(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))
  • 我的错!感谢您的澄清
  • @KJM 每次发布都会有一些变化。我同意当您使用不同的版本时它会变得不合适。抱歉,我忘了说版本
  • @KJM 它可能不起作用,因为数量 %*% 价格应该在日期和市场范围内。您使用的代码将对整个列进行计算
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2015-02-11
  • 2018-12-29
  • 1970-01-01
  • 2013-10-25
  • 1970-01-01
  • 2021-02-16
  • 1970-01-01
相关资源
最近更新 更多