【问题标题】:Monthly to Daily Value - R每月至每日价值 - R
【发布时间】:2022-01-12 18:18:10
【问题描述】:

我有一个包含每月数据的 data.frame df:

Company Store Brand Month Sales Budget Quantity Year
A Store A Brand A Jun $300 $300 3000 2022
A Store A Brand A Jul $300 $300 3000 2022
A Store A Brand A Aug $300 $300 3000 2022
A Store A Brand A Sep $300 $300 3000 2022

我希望每天都有平均值,例如(Jun 有 30 天,所以销售额 300 美元 / 30 天 = 每天 10 美元):

Company Store Brand Month Sales Budget Quantity Date
A Store A Brand A Jun $10 $10 100 01-06-2022
A Store A Brand A Jun $10 $10 100 02-06-2022
A Store A Brand A Jun $10 $10 100 03-06-2022
A Store A Brand A Jun $10 $10 100 04-06-2022
A Store A Brand A Jun $10 $10 100 05-06-2022
A Store A Brand A Jun $10 $10 100 06-06-2022
A Store A Brand A Jun $10 $10 100 07-06-2022

我不知道代码可以使用什么函数。

谢谢!

【问题讨论】:

  • 哪种语言? Ago 不是一个可识别的月份。可能你的意思是Aug
  • 是的!抱歉,我的主要语言是西班牙语
  • @Onyambu,来自web.library.yale.edu/cataloging/months我猜是西班牙语还是葡萄牙语。

标签: r date


【解决方案1】:

前面有几件事:

  • 我推断您的语言环境设置为西班牙语(基于"Ago",我假设是 8 月)。要运行此代码,我首先设置我的本地语言环境,以便它能够正确解析。您可能不需要这个,但其他人(使用其他语言)可能需要这个或类似的东西来测试此代码。

    prevlocale <- Sys.getlocale("LC_TIME")
    Sys.setlocale("LC_TIME", "Spanish")
    # [1] "Spanish_Spain.1252"
    format(as.Date("2022-08-01"), format = "%b")
    # [1] "ago"
    ### when done and you want to return to your local locale
    Sys.setlocale("LC_TIME", prevlocale)
    
  • 您的数字列不是数字。我会将它们更改为numeric,否则数学运算将不起作用,您可以根据需要将它们重新格式化为$-strings。

  • 我们不能以简单的一步逻辑对所有行执行此操作,因为每个月都有不同的天数。要向前迈进,第一个挑战是确定每个月有多少天。有几种方法可以解决这个问题(包括lubridate 包),我将提供一个base-R 解决方案(使用as.POSIXlt)来解决这个问题并返回正确的日期向量。

    yrmon2days <- function(yr, mon) {
      stopifnot(length(yr) == 1L, length(mon) == 1L)
      day2 <- day1 <- as.POSIXlt(as.Date(paste(yr, mon, "01", sep = "-"), format = "%Y-%b-%d"))
      day2$mon <- day2$mon + 1L
      seq(day1, day2-1, by = "day")
    }
    yrmon2days(2022, "Feb")
    #  [1] "2022-02-01 UTC" "2022-02-02 UTC" "2022-02-03 UTC" "2022-02-04 UTC" "2022-02-05 UTC" "2022-02-06 UTC" "2022-02-07 UTC"
    #  [8] "2022-02-08 UTC" "2022-02-09 UTC" "2022-02-10 UTC" "2022-02-11 UTC" "2022-02-12 UTC" "2022-02-13 UTC" "2022-02-14 UTC"
    # [15] "2022-02-15 UTC" "2022-02-16 UTC" "2022-02-17 UTC" "2022-02-18 UTC" "2022-02-19 UTC" "2022-02-20 UTC" "2022-02-21 UTC"
    # [22] "2022-02-22 UTC" "2022-02-23 UTC" "2022-02-24 UTC" "2022-02-25 UTC" "2022-02-26 UTC" "2022-02-27 UTC" "2022-02-28 UTC"
    

    当前当前不可矢量化;可以这样做,但是周围的数据还有其他复杂性,这使得这一步目前有点过分了。

  • 我尝试过使用dplyr::group_by 并进行一般分组,但虽然这很有意义,但我不想假设每年/每月一行。有了这种预防措施,很明显我们需要逐行操作,而不是可能(虽然不是使用此数据)返回每组中的 1 行以外的东西。

dplyr

library(dplyr)
dat %>%
  mutate(across(c(Sales, Budget), ~ as.numeric(gsub("\\$", "", .)))) %>%
  rowwise() %>%
  summarize(
    Date = yrmon2days(Year, Month),
    Company, Store, Brand, Year, Month,
    across(c(where(is.numeric), -Year), ~ . / length(Date))
  )
# # A tibble: 122 x 9
#    Date                Company Store   Brand    Year Month Sales Budget Quantity
#    <dttm>              <chr>   <chr>   <chr>   <int> <chr> <dbl>  <dbl>    <dbl>
#  1 2022-06-01 00:00:00 A       Store A Brand A  2022 Jun      10     10      100
#  2 2022-06-02 00:00:00 A       Store A Brand A  2022 Jun      10     10      100
#  3 2022-06-03 00:00:00 A       Store A Brand A  2022 Jun      10     10      100
#  4 2022-06-04 00:00:00 A       Store A Brand A  2022 Jun      10     10      100
#  5 2022-06-05 00:00:00 A       Store A Brand A  2022 Jun      10     10      100
#  6 2022-06-06 00:00:00 A       Store A Brand A  2022 Jun      10     10      100
#  7 2022-06-07 00:00:00 A       Store A Brand A  2022 Jun      10     10      100
#  8 2022-06-08 00:00:00 A       Store A Brand A  2022 Jun      10     10      100
#  9 2022-06-09 00:00:00 A       Store A Brand A  2022 Jun      10     10      100
# 10 2022-06-10 00:00:00 A       Store A Brand A  2022 Jun      10     10      100
# # ... with 112 more rows

基础 R

dat[c("Sales","Budget")] <- lapply(dat[c("Sales","Budget")], function(z) as.numeric(gsub("\\$", "", z,)))
isnum <- sapply(dat, is.numeric)
isnum[which(colnames(dat) == "Year")] <- FALSE
out <- do.call(rbind, lapply(seq_len(nrow(dat)), function(rn) {
  Date <- yrmon2days(dat$Year[rn], dat$Month[rn])
  Nums <- lapply(dat[rn,isnum], `/`, length(Date))
  suppressWarnings( # "row names were found from a short variable and have been discarded"
    cbind(dat[rn,!isnum], Nums, data.frame(Date = Date))
  )
}))
head(out)
#   Company   Store   Brand Month Year Sales Budget Quantity       Date
# 1       A Store A Brand A   Jun 2022    10     10      100 2022-06-01
# 2       A Store A Brand A   Jun 2022    10     10      100 2022-06-02
# 3       A Store A Brand A   Jun 2022    10     10      100 2022-06-03
# 4       A Store A Brand A   Jun 2022    10     10      100 2022-06-04
# 5       A Store A Brand A   Jun 2022    10     10      100 2022-06-05
# 6       A Store A Brand A   Jun 2022    10     10      100 2022-06-06

数据

dat <- structure(list(Company = c("A", "A", "A", "A"), Store = c("Store A", "Store A", "Store A", "Store A"), Brand = c("Brand A", "Brand A", "Brand A", "Brand A"), Month = c("Jun", "Jul", "Ago", "Sep"), Sales = c("$300", "$300", "$300", "$300"), Budget = c("$300", "$300", "$300", "$300"), Quantity = c(3000L, 3000L, 3000L, 3000L), Year = c(2022L, 2022L, 2022L, 2022L)), class = "data.frame", row.names = c(NA, -4L))

【讨论】:

  • 当我同时使用两个选项(Dplyr 和 Base)时,我得到这个错误:“seq.int(0, to0 - from, by) 中的错误:'to' 必须是有限数”。我尝试使用我的数据框和您的示例数据库,我得到了同样的错误。
  • 检查以确保您的所有月份都被识别。 unique(setdiff(tolower(dat$Month), tolower(format(as.Date(paste(2022, 1:12, 1, sep="-")), format = "%b")))) 返回什么?
猜你喜欢
  • 2020-07-03
  • 1970-01-01
  • 2019-10-13
  • 2021-06-19
  • 2019-08-22
  • 2020-01-19
  • 2015-07-14
  • 2016-04-07
  • 1970-01-01
相关资源
最近更新 更多