【问题标题】:Sequence number for duplicate rows in rr中重复行的序列号
【发布时间】:2021-08-31 10:51:20
【问题描述】:

我有一个包含数字和字符列的数据框,其中一些行是重复的。为了区分这些行,我想向重复行的每个“块”添加一个从 1:n 开始的序列号作为新列(在我的示例中称为“duplicateID”)。

我的数据集如下所示:

a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
df1 <-data.frame(a,b)

>df1
       a   b
1    one 3.5
2    one 3.5
3    one 3.5
4    one 2.5
5    two 3.5
6    two 3.5
7  three 1.0
8   four 2.2
9   four 7.0
10  four 7.0

期望的输出是:

a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
duplicateID = c(1, 2, 3, 1, 1, 2, 1, 1, 1, 2)
df2 <-data.frame(a,b,duplicateID)

>df2 
       a   b duplicateID
1    one 3.5           1
2    one 3.5           2
3    one 3.5           3
4    one 2.5           1
5    two 3.5           1
6    two 3.5           2
7  three 1.0           1
8   four 2.2           1
9   four 7.0           1
10  four 7.0           2

提前谢谢大家!

【问题讨论】:

    标签: r dplyr lapply


    【解决方案1】:

    使用dplyr 实现这一目标的一种方法:

    library(dplyr)
    
    df1 %>% 
        # build grouping by combination of variables
        dplyr::group_by(a, b) %>%
        # add row number which works per group due to prior grouping
        dplyr::mutate(duplicateID = dplyr::row_number()) %>%
        # ungroup to prevent unexpected behaviour down stream
        dplyr::ungroup()
    
    # A tibble: 10 x 3
       a         b  duplicateID
       <chr> <dbl>     <int>
     1 one     3.5       1
     2 one     3.5       2
     3 one     3.5       3
     4 one     2.5       1
     5 two     3.5       1
     6 two     3.5       2
     7 three   1         1
     8 four    2.2       1
     9 four    7         1
    10 four    7         2
    

    【讨论】:

    • 感谢您提供这个简单的解决方案!很高兴知道 dplyr 可以做到这一点!
    • 正如信息:我们可以在此设置中使用acrosslibrary(dplyr) df1 %&gt;% group_by(across()) %&gt;% mutate(duplicatedID = row_number())
    【解决方案2】:

    可能不如 dplyr 快(当然 data.table 也有选项),但在基础 R 中,您可以使用带有“seq_along”的“ave”函数来实现这一点:

    a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
    b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
    df1 <-data.frame(a,b)
    df1$dupID = NA
    df1$dupID = with(df1,ave(dupID,b,a,FUN = seq_along))
    

    【讨论】:

      【解决方案3】:

      我们可以使用rowid

      library(data.table)
      setDT(df1)[, dupID := rowid(a, b)]
      

      -输出

      > df1
              a   b dupID
       1:   one 3.5     1
       2:   one 3.5     2
       3:   one 3.5     3
       4:   one 2.5     1
       5:   two 3.5     1
       6:   two 3.5     2
       7: three 1.0     1
       8:  four 2.2     1
       9:  four 7.0     1
      10:  four 7.0     2
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2015-10-30
        • 1970-01-01
        • 1970-01-01
        • 2019-04-07
        • 2023-01-19
        相关资源
        最近更新 更多