总结不同组的不同列答案

【问题标题】：Sum up different columns for different groups总结不同组的不同列
【发布时间】：2017-03-18 09:52:27
【问题描述】：

我有一个数据框，其中包含有关不同国家/地区某些组织的活动的信息。 orga 列包含组织的名称，c1 到 c4 是国家列，包含组织在该国开展的活动数量, home 是组织的居住国。 home中的值对应c1到c4列名中的数字。

orga <- c("AA", "AB", "AC", "BA", "BB", "BC", "BD")
c1 <- c(3,1,0,0,2,0,1)
c2 <- c(0,2,2,0,1,0,1)
c3 <- c(1,0,0,1,0,2,0)
c4 <- c(0,1,1,0,0,0,0)
home <- c(1,2,3,2,1,3,1)
df <- data.frame(orga, c1, c2, c3, c4, home)

我知道想添加一个额外的列 foreign，包含有关组织所有外国活动的信息，总结了 c1 到 c4 但不在在本国的列中。因此，该函数不应总结所有国家/地区列，而应仅总结不是本国的列。例如，如果 home=1，则应省略 c1，如果 home=2，则应省略 c2，等等。

在示例情况下，foreign 应如下所示：

df$foreign <- c(1,2,3,1,1,0,1)

有没有办法总结不同组的列，为每个组留出不同的列，并将总和作为新列添加到数据框中？

我已经看过dplyr-package的group by函数，以及base-r中的aggregate和tapply，但是想不出解决办法。因此，我非常感谢您的帮助。谢谢！

【问题讨论】：

您的问题得到答案了吗？如果是，您可以选择答案作为已回答。

标签： r dataframe

【解决方案1】：

一种使用rowSums的方法，

diag(as.matrix(rowSums(df[2:5])- df[2:5][df$home]))
#[1] 1 2 3 1 1 0 1

【讨论】：

【解决方案2】：

这是使用dplyr 和tidyr 包的解决方案。

library(dplyr)
library(tidyr)

df2 <- df %>%
  # Change the home column from number to character,
  # Make the ID (c1, c2, c3, c4) consistent to the column names from c1 to c4
  mutate(home = paste0("c", home)) %>%
  # Convert the data frame from wide format to long format
  # activity contains the columns names from c1 to c4 as labels
  # number is the original number for each
  gather(activity, number, -orga, -home) %>%
  # Remove rows when home and activity number are the same
  filter(home != activity) %>%
  # Group by the organization
  group_by(orga) %>%
  # Calculate the total number of activities, call it foreign
  summarise(foreign = sum(number)) %>%
  # Join the results back with df by organization
  left_join(df, by = "orga") %>%
  # Re-organiza the column
  select(orga, c1:home, foreign)

这是最终结果。您想要的信息在数据框df2 的foreign 列中。

# A tibble: 7 × 7
    orga    c1    c2    c3    c4  home foreign
  <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1     AA     3     0     1     0     1       1
2     AB     1     2     0     1     2       2
3     AC     0     2     0     1     3       3
4     BA     0     0     1     0     2       1
5     BB     2     1     0     0     1       1
6     BC     0     0     2     0     3       0
7     BD     1     1     0     0     1       1

【讨论】：

这太棒了，因为它对我来说似乎是最灵活的解决方案。谢谢你的精彩解释！

【解决方案3】：

这是使用rowSums 的另一个选项。使用row/column 索引，我们将数据集副本中的值替换为NA，然后使用rowSums 和na.rm=TRUE 获取行的总和以排除“home”列

df1 <- df
df1[-1][cbind(1:nrow(df), df$home)] <- NA
df$foreign <- rowSums(df1[2:5],na.rm=TRUE) 
df$foreign
#[1] 1 2 3 1 1 0 1

或使用apply

df$foreign <- apply(df[-1], 1, function(x) sum(head(x, -1)[-x[5]]))
df$foreign
#[1] 1 2 3 1 1 0 1

【讨论】：