r中重复行的序列号答案

【问题标题】：Sequence number for duplicate rows in rr中重复行的序列号
【发布时间】：2021-08-31 10:51:20
【问题描述】：

我有一个包含数字和字符列的数据框，其中一些行是重复的。为了区分这些行，我想向重复行的每个“块”添加一个从 1:n 开始的序列号作为新列（在我的示例中称为“duplicateID”）。

我的数据集如下所示：

a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
df1 <-data.frame(a,b)

>df1
       a   b
1    one 3.5
2    one 3.5
3    one 3.5
4    one 2.5
5    two 3.5
6    two 3.5
7  three 1.0
8   four 2.2
9   four 7.0
10  four 7.0

期望的输出是：

a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
duplicateID = c(1, 2, 3, 1, 1, 2, 1, 1, 1, 2)
df2 <-data.frame(a,b,duplicateID)

>df2 
       a   b duplicateID
1    one 3.5           1
2    one 3.5           2
3    one 3.5           3
4    one 2.5           1
5    two 3.5           1
6    two 3.5           2
7  three 1.0           1
8   four 2.2           1
9   four 7.0           1
10  four 7.0           2

提前谢谢大家！

【问题讨论】：

标签： r dplyr lapply

【解决方案1】：

使用dplyr 实现这一目标的一种方法：

library(dplyr)

df1 %>% 
    # build grouping by combination of variables
    dplyr::group_by(a, b) %>%
    # add row number which works per group due to prior grouping
    dplyr::mutate(duplicateID = dplyr::row_number()) %>%
    # ungroup to prevent unexpected behaviour down stream
    dplyr::ungroup()

# A tibble: 10 x 3
   a         b  duplicateID
   <chr> <dbl>     <int>
 1 one     3.5       1
 2 one     3.5       2
 3 one     3.5       3
 4 one     2.5       1
 5 two     3.5       1
 6 two     3.5       2
 7 three   1         1
 8 four    2.2       1
 9 four    7         1
10 four    7         2

【讨论】：

感谢您提供这个简单的解决方案！很高兴知道 dplyr 可以做到这一点！
正如信息：我们可以在此设置中使用across：library(dplyr) df1 %>% group_by(across()) %>% mutate(duplicatedID = row_number())

【解决方案2】：

可能不如 dplyr 快（当然 data.table 也有选项），但在基础 R 中，您可以使用带有“seq_along”的“ave”函数来实现这一点：

a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
df1 <-data.frame(a,b)
df1$dupID = NA
df1$dupID = with(df1,ave(dupID,b,a,FUN = seq_along))

【讨论】：

【解决方案3】：

我们可以使用rowid

library(data.table)
setDT(df1)[, dupID := rowid(a, b)]

-输出

> df1
        a   b dupID
 1:   one 3.5     1
 2:   one 3.5     2
 3:   one 3.5     3
 4:   one 2.5     1
 5:   two 3.5     1
 6:   two 3.5     2
 7: three 1.0     1
 8:  four 2.2     1
 9:  four 7.0     1
10:  four 7.0     2

【讨论】：