如何根据R中的三个不同条件求和答案

【问题标题】：how to sum based on three different condition in R如何根据R中的三个不同条件求和
【发布时间】：2019-12-23 05:09:42
【问题描述】：

以下是我的数据。

gcode code year   P  Q
1      101  2000  1  3
1      101  2001  2  4
1      102  2000  1  1
1      102  2001  4  5
1      102  2002  2  6
1      102  2003  6  5
1      103  1999  6  1
1      103  2000  4  2
1      103  2001  2  1
2      104  2000  1  3
2      104  2001  2  4
2      105  2001  4  5
2      105  2002  2  6
2      105  2003  6  5
2      105  2004  6  1
2      106  2000  4  2
2      106  2001  2  1

gcode 1 有 3 个不同的代码 101、102 和 103。它们都有相同的年份（2000 和 2001）。我想总结一下P和Q这些年。否则，我想删除不相关的数据。我也想为gcode 2 做同样的事情。

我怎样才能得到这样的结果？

gcode  year   P       Q
1      2000   1+1+4   3+1+2
1      2001   2+4+2   4+5+1
2      2001   2+4+2   4+5+1

【问题讨论】：

请删除第一行“1 2000 5 5”，gcode=1没有2000年的数据，因为code=102没有2000年的数据
好的，非常感谢！你知道如何在 R 中快速实现吗？
请将您的问题编辑为您准确期望的内容。否则会令人困惑。
抱歉，第一次提问。我犯了一个愚蠢的错误，对此感到抱歉。现在，我想这很清楚了。对于gcode=1,code=101,102,103都有2001年的数据；gcode=2,也一样
对不起，我是新来的。现在我对我的输入和输出进行最后一次更改。非常感谢您的帮助！

标签： r group-by dplyr sum

【解决方案1】：

我们可以split 基于gcode 的数据对基于常见year 的数据进行子集化，这在gcode 和year 的所有数据中存在于所有code 和aggregate 数据中。

do.call(rbind, lapply(split(df, df$gcode), function(x) {
      aggregate(cbind(P, Q)~gcode+year, 
               subset(x, year %in% Reduce(intersect, split(x$year, x$code))), sum)
}))

#    gcode year P  Q
#1.1     1 2000 6  6
#1.2     1 2001 8 10
#2       2 2001 8 10

使用dplyr 和我们可以做的类似逻辑

library(dplyr)
df %>%
  group_split(gcode) %>%
  purrr::map_df(. %>% 
                 group_by(year) %>% 
                 filter(n_distinct(code) == n_distinct(.$code)) %>% 
                 group_by(gcode, year) %>%
                 summarise_at(vars(P:Q), sum))

数据

df <- structure(list(gcode = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), code = c(101L, 101L, 102L, 102L, 
102L, 102L, 103L, 103L, 103L, 104L, 104L, 105L, 105L, 105L, 105L, 
106L, 106L), year = c(2000L, 2001L, 2000L, 2001L, 2002L, 2003L, 
1999L, 2000L, 2001L, 2000L, 2001L, 2001L, 2002L, 2003L, 2004L, 
2000L, 2001L), P = c(1L, 2L, 1L, 4L, 2L, 6L, 6L, 4L, 2L, 1L, 
2L, 4L, 2L, 6L, 6L, 4L, 2L), Q = c(3L, 4L, 1L, 5L, 6L, 5L, 1L, 
2L, 1L, 3L, 4L, 5L, 6L, 5L, 1L, 2L, 1L)), class = "data.frame", 
row.names = c(NA, -17L))

【讨论】：

是的！我现在得到了答案。再次感谢！
还有一个问题，如果我想在excel中提取输出数据，对于这两种方法，我该怎么做呢？
@XUNZHANG 你可以将上面的输出存储在一个变量new_file <- do.call(rbind......rest of the code中，并使用write.csv(new_file, 'filename.csv', row.names = FALSE)在excel中打开它。
我明白了。再次感谢！

【解决方案2】：

使用data.table 包的选项：

years <- DT[, {
    m <- min(year)
    ty <- tabulate(year-m)
    .(year=which(ty==uniqueN(code)) + m)
}, gcode]

DT[years, on=.(gcode, year),
    by=.EACHI, .(P=sum(P), Q=sum(Q))]

输出：

   gcode year P  Q
1:     1 2000 6  6
2:     1 2001 8 10
3:     2 2001 8 10

数据：

library(data.table)
DT <- fread("gcode code year   P  Q
1      101  2000  1  3
1      101  2001  2  4
1      102  2000  1  1
1      102  2001  4  5
1      102  2002  2  6
1      102  2003  6  5
1      103  1999  6  1
1      103  2000  4  2
1      103  2001  2  1
2      104  2000  1  3
2      104  2001  2  4
2      105  2001  4  5
2      105  2002  2  6
2      105  2003  6  5
2      105  2004  6  1
2      106  2000  4  2
2      106  2001  2  1")

【讨论】：

非常感谢您的回复。我会试试你的答案。

【解决方案3】：

我想出了以下解决方案。首先，我计算了每个gcode 每年出现的次数。我还计算了每个gcode 存在多少个唯一代码。然后，使用left_join() 连接两个结果。然后，我在n_year 和n_code 中识别出具有相同值的行。然后，我加入了原始数据框，称为mydf。然后，我通过gcode 和year 定义了组，并为每个组总结了P 和Q。

library(dplyr)

left_join(count(mydf, gcode, year, name = "n_year"),
          group_by(mydf, gcode) %>% summarize(n_code = n_distinct(code))) %>% 
filter(n_year == n_code) %>% 
left_join(mydf, by = c("gcode", "year")) %>% 
group_by(gcode, year) %>% 
summarize_at(vars(P:Q),
             .funs = list(~sum(.)))

#  gcode  year     P     Q
#  <int> <int> <int> <int>
#1     1  2000     6     6
#2     1  2001     8    10
#3     2  2001     8    10

另一个想法

后来我在复习这个问题并提出了以下想法，这个想法要简单得多。首先，我通过gcode 和year 定义了组。对于每个组，我使用add_count() 计算了存在多少数据点。然后，我只用gcode 再次定义了组。对于每个 gcode 组，我想获得符合 n == n_distinct(code) 的行。 n 是由add_count() 创建的列。如果n 中的一个数字与n_distinct() 返回的一个数字匹配，则意味着该行中的年份存在于所有code 中。最后，我再次通过gcode 和year 定义了组，并将P 和Q 中的值相加。

group_by(mydf, gcode, year) %>% 
add_count() %>% 
group_by(gcode) %>% 
filter(n == n_distinct(code)) %>%
group_by(gcode, year) %>% 
summarize_at(vars(P:Q),
             .funs = list(~sum(.)))

# This is the same code in data.table.
setDT(mydf)[, check := .N, by = .(gcode, year)][,
            .SD[check == uniqueN(code)], by = gcode][,
            lapply(.SD, sum), .SDcols = P:Q, by = .(gcode, year)][]

【讨论】：

是的，我和你有同样的逻辑。但我就是没有输出。非常感谢！
还有一个问题，如果我想在excel中提取输出数据，对于你的方法，我该怎么做？
您的意思是要将结果保存为 excel 文件？
很高兴为您提供帮助。 :)
a