Python 或 R - 像内部连接一样连接/追加行答案

【问题标题】：Python or R - Concat/Appending rows like an inner joinPython 或 R - 像内部连接一样连接/追加行
【发布时间】：2022-01-08 05:08:59
【问题描述】：

我有 n 个代表每周时间段的数据框。我想做一些类似内部连接的事情（基于 id 和 id2），但用于附加来自所有 n 个数据集的行，而不是附加列（因为它们都是相同的）。

他们都是这样的

DF1:
id   id2    A      B     C     PERIOD
1     50   0.1    0.2    0.3     1
1    100   0.1    0.2    0.3     1
2     2    0.1    0.2    0.3     1

DF2:
id   id2    A      B     C     PERIOD
1     50   0.5    0.7    0.9     2
1    100   0.6    0.8    0.9     2
1    105   0.1    0.2    0.3     2
2     2    0.3    0.4    0.5     2
2     3    0.1    0.2    0.3     2

...重复DFn

我想要一个像这样的数据框

id   id2    A      B     C     PERIOD
1     50   0.1    0.2    0.3     1
1     50   0.5    0.7    0.9     2
...                              n

1    100   0.1    0.2    0.3     1
1    100   0.6    0.8    0.9     2
...                              n

2     2    0.1    0.2    0.3     1
2     2    0.3    0.4    0.5     2
...                              n

因此它会丢弃所有未出现在我的所有 n 个数据集中的 id、id2 组合。有没有快速的方法？

我正在考虑首先遍历所有 n 个数据帧，抓取 id、id2 对的集合，然后对所有这些集合进行交集，然后将数据帧减少 .isin，然后在减少的数据帧列表。不过这似乎很乏味。

【问题讨论】：

我可以在 tidyverse 或 pandas 中做到这一点，这并不重要
我建议 concat all 然后stackoverflow.com/questions/49735683/…。使用 id, id2 进行分组并将计数条件设置为 n
聪明！我忘了我可以清理重复项以确保不同的 n 行来自 n 个数据帧。谢谢

标签： python r pandas dataframe

【解决方案1】：

Reduce(function(a, b) {
  rbind(
    merge(a, b[,c("id","id2")], by = c("id", "id2")),
    merge(b, a[,c("id","id2")], by = c("id", "id2"))
  )
}, list(DF1, DF2))
#   id id2   A   B   C PERIOD
# 1  1 100 0.1 0.2 0.3      1
# 2  1  50 0.1 0.2 0.3      1
# 3  2   2 0.1 0.2 0.3      1
# 4  1 100 0.6 0.8 0.9      2
# 5  1  50 0.5 0.7 0.9      2
# 6  2   2 0.3 0.4 0.5      2

如果你不想merge两次，那么你可以使用：

Reduce(function(a, b) {
  tmp <- merge(a, b, by = c("id", "id2"), all = TRUE)
  tmp <- tmp[complete.cases(tmp),]
  tmpx <- tmp[,c("id", "id2", grep("\\.x$", colnames(tmp), value = TRUE))]
  colnames(tmpx) <- gsub("\\.x$", "", colnames(tmpx))
  tmpy <- tmp[,c("id", "id2", grep("\\.y$", colnames(tmp), value = TRUE))]
  colnames(tmpy) <- gsub("\\.y$", "", colnames(tmpy))
  rbind(tmpx, tmpy)
}, list(DF1, DF2))
#    id id2   A   B   C PERIOD
# 1   1  50 0.1 0.2 0.3      1
# 2   1 100 0.1 0.2 0.3      1
# 4   2   2 0.1 0.2 0.3      1
# 11  1  50 0.5 0.7 0.9      2
# 21  1 100 0.6 0.8 0.9      2
# 41  2   2 0.3 0.4 0.5      2

实现 MYousefi 建议的逻辑，这里是第三种选择：

bind_rows(DF1, DF2) %>%
  group_by(id, id2) %>%
  filter(n() == 2L) %>%  # 2 is the number of frames joined
  ungroup()
# # A tibble: 6 x 6
#      id   id2     A     B     C PERIOD
#   <int> <int> <dbl> <dbl> <dbl>  <int>
# 1     1    50   0.1   0.2   0.3      1
# 2     1   100   0.1   0.2   0.3      1
# 3     2     2   0.1   0.2   0.3      1
# 4     1    50   0.5   0.7   0.9      2
# 5     1   100   0.6   0.8   0.9      2
# 6     2     2   0.3   0.4   0.5      2

【讨论】：

【解决方案2】：

我们可以使用来自dplyr 包R 的group_split：您将获得一份清单！

library(dplyr)

bind_rows(DF1, DF2) %>% 
  group_split(id, id2)

[[1]]
# A tibble: 2 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     1    50   0.1   0.2   0.3      1
2     1    50   0.5   0.7   0.9      2

[[2]]
# A tibble: 2 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     1   100   0.1   0.2   0.3      1
2     1   100   0.6   0.8   0.9      2

[[3]]
# A tibble: 1 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     1   105   0.1   0.2   0.3      2

[[4]]
# A tibble: 2 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     2     2   0.1   0.2   0.3      1
2     2     2   0.3   0.4   0.5      2

[[5]]
# A tibble: 1 x 6
     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     2     3   0.1   0.2   0.3      2

如果你想得到一个数据框：你可以这样做：

library(dplyr)
bind_rows(DF1, DF2) %>% 
  group_split(id, id2) %>% 
  bind_rows()

或者简单地说：

library(dplyr)

bind_rows(DF1, DF2) %>% 
  arrange(id, id2)

     id   id2     A     B     C PERIOD
  <int> <int> <dbl> <dbl> <dbl>  <int>
1     1    50   0.1   0.2   0.3      1
2     1    50   0.5   0.7   0.9      2
3     1   100   0.1   0.2   0.3      1
4     1   100   0.6   0.8   0.9      2
5     1   105   0.1   0.2   0.3      2
6     2     2   0.1   0.2   0.3      1
7     2     2   0.3   0.4   0.5      2
8     2     3   0.1   0.2   0.3      2

【讨论】：