在 R 中对日期使用 Countif答案

【问题标题】：Using Countif on Dates in R在 R 中对日期使用 Countif
【发布时间】：2017-08-07 22:39:46
【问题描述】：

我有下表

**A**  | **B**  | **C** |**D** |
:----: | :----: | :----:|:----:|
1/1/17 | 3/1/17 |4/1/17 | H    |
1/1/17 | 3/1/17 |4/1/17 | H    |
2/1/17 | 4/1/17 |5/1/17 | V    |
3/1/17 | 5/1/17 |6/1/17 | V    |
4/1/17 | 5/1/17 |7/1/17 | H    |
4/1/17 | 6/1/17 |7/1/17 | H    |

使用R代码查找下表中的结果

 1. A column with Unique list of dates from columns A,B & C above
 2. A count of dates <= (less than or equal to) the unique 
    dates column value in each of the columns A,B & C from above table. 
 3. Filtered by column D value of 'H' only

结果

**Unique Dates**  | **Count of A**  | **Count of B** |**Count of C** |
    :----:        |     :----:      |     :----:     |     :----:    |
    1/1/17        |       2         |       0        |       0       |
    2/1/17        |       2         |       0        |       0       |
    3/1/17        |       2         |       2        |       0       |
    4/1/17        |       4         |       2        |       2       |
    5/1/17        |       4         |       3        |       2       |
    6/1/17        |       4         |       4        |       2       |
    7/1/17        |       4         |       0        |       4       |

【问题讨论】：

为您的示例数据发布 dput(.) 的输出。目前标题和日期不是标准格式，表明您还没有完成正确的数据输入作业，
嗨@dhu，您在下面收到了一些答案。如果其中一个解决了您的问题，请考虑通过单击左侧的复选标记来接受它作为答案。这让社区知道答案有效并且您的问题应该被关闭。

标签： r date dataframe spotfire countif

【解决方案1】：

您的数据作为可重复的示例

library(lubridate)
df <- data.frame(A=dmy(c("1/1/17","1/1/17","2/1/17","3/1/17","4/1/17","4/1/17")),
             B=dmy(c("3/1/17","3/1/17","4/1/17","5/1/17","5/1/17","6/1/17")),
             C=dmy(c("4/1/17","4/1/17","5/1/17","6/1/17","7/1/17","7/1/17")),
             D=c("H","H","V","V","H","H"),stringsAsFactors=F)

tidyverse 和 zoo 解决方案

library(tidyverse)
library(zoo)
df %>% 
  filter(D=="H") %>%             # uses only rows where column D == H
  gather(Date, value, -D) %>%    # gather Dates into long format, ignore column D
  select(-D) %>%                 # unselect column D
  group_by(Date, value) %>%      # group by Dates
  summarise(Count = length(value)) %>%    # Count occurrence of Date
  arrange(Date) %>%                       # Sort Date
  mutate(Count = cumsum(Count)) %>%       # cumulative sum of Dates (<=)
  spread(Date, Count) %>%                 # spread Count into wide format
  mutate_at(vars(A:C), na.locf, na.rm=F) %>%   # fill NAs forward
  replace(is.na(.), 0)                         # fill remaining NA with 0

输出

       value     A     B     C
1 2017-01-01     2     0     0
2 2017-01-03     2     2     0
3 2017-01-04     4     2     2
4 2017-01-05     4     3     2
5 2017-01-06     4     4     2
6 2017-01-07     4     4     4

请注意，2017-01-02 缺失，因为它不是显示在输入数据中的唯一日期

【讨论】：

【解决方案2】：

乍一看，这个问题似乎是一个简单的重塑任务。仔细观察会发现，如果我们想准确地遵循 OP 的规范，这些要求并不容易实现：

具有上述 A、B 和 C 列的唯一日期列表的列

日期计数

仅按“H”的 D 列值过滤

下面的data.table 解决方案将数据从宽格式重塑为长格式，进行所有聚合，包括通过分组来补充长格式中缺失的组合，最后重塑为宽格式。代码中的 cmets 中给出了额外的解释。

library(data.table)   # CRAN version 1.10.4 used
# coerce to data.table
setDT(DT)[
  # reshape from wide to long format, 
  # thereby renaming one column as requested
  , melt(.SD, id.vars = "D", value.name = "Unique_Dates")][
    # convert dates from character to class Date
    , Unique_Dates := lubridate::dmy(Unique_Dates)][
      # count occurences by variable & date, 
      # set key & order by variable & date for subsequent cumsum & join
      , .N, keyby = .(D, variable, Unique_Dates)][
        # compute cumsum for each variable along unique dates
        , N := cumsum(N), by = .(D, variable)][
          # join with all possible combinations of D, variables and dates
          # use rolling join to fill missing values
          CJ(D, variable, Unique_Dates, unique = TRUE), roll = Inf][
            # replace remaining NAs
            is.na(N), N := 0L][
              # finally, reshape selected rows from long to wide
              D == "H", dcast(.SD, Unique_Dates ~ paste0("Count_of_", variable))]

   Unique_Dates Count_of_A Count_of_B Count_of_C
1:   2017-01-01          2          0          0
2:   2017-01-02          2          0          0
3:   2017-01-03          2          2          0
4:   2017-01-04          4          2          2
5:   2017-01-05          4          3          2
6:   2017-01-06          4          4          2
7:   2017-01-07          4          4          4

这些列是根据 OP 的预期结果命名的。
结果包含2017-01-02，正如预期的那样，尽管此日期仅与D == "V" 出现在一行中，而D == "V" 应该被排除在最终结果之外。
滚动连接用于填充缺失值，而不是zoo::na.locf()。

数据

在他的问题中，OP 提供了难以“抓取”的打印格式的示例数据：

library(data.table)
DT <- fread(
  "**A**  | **B**  | **C** |**D** |
  1/1/17 | 3/1/17 |4/1/17 | H    |
  1/1/17 | 3/1/17 |4/1/17 | H    |
  2/1/17 | 4/1/17 |5/1/17 | V    |
  3/1/17 | 5/1/17 |6/1/17 | V    |
  4/1/17 | 5/1/17 |7/1/17 | H    |
  4/1/17 | 6/1/17 |7/1/17 | H    |",
  sep ="|", drop = 5L, stringsAsFactors = TRUE)[
    , setnames(.SD, stringr::str_replace_all(names(DT), "\\*", ""))][]
DT

        A      B      C D
1: 1/1/17 3/1/17 4/1/17 H
2: 1/1/17 3/1/17 4/1/17 H
3: 2/1/17 4/1/17 5/1/17 V
4: 3/1/17 5/1/17 6/1/17 V
5: 4/1/17 5/1/17 7/1/17 H
6: 4/1/17 6/1/17 7/1/17 H

【讨论】：