【问题标题】:removing duplicates with aggregated groups in R在 R 中删除具有聚合组的重复项
【发布时间】:2019-09-18 10:38:29
【问题描述】:

这是我的数据示例:

kod <- structure(list(ID_WORKES = c(28029571L, 28029571L, 28029571L, 
28029571L, 28029571L, 28029571L, 28029571L, 28029571L, 28029571L
), TABL_NOM = c(9716L, 9716L, 9716L, 9716L, 9716L, 9716L, 9716L, 
9716L, 9716L), NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L), .Label = "Dim", class = "factor"), ID_SP_NAR = c(20L, 
20L, 20L, 30L, 30L, 30L, 30L, 30L, 30L), KOD_DOR = c(28L, 28L, 
28L, 28L, 28L, 28L, 28L, 28L, 28L), KOD_DEPO = c(9167L, 9167L, 
9167L, 9167L, 9167L, 9167L, 9167L, 9167L, 9167L), COLUMN_MASH = c(13L, 
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L), prop_violations = c(0.00561797752808989, 
0.00293255131964809, 0.00495049504950495, 0.00215982721382289, 
0.0120481927710843, 0.00561797752808989, 0.00293255131964809, 
0.00591715976331361, 0.00495049504950495), mash_score = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, -9L), class = "data.frame")
W

我想实现的目标如下:

对于由ID_WORKESTABL_NOMNAMEKOD_DORKOD_DEPO 列组成的每个组,我希望在ID_SP_NAR 中有一个唯一值。

例如,这里有六行,其中ID_SP_NAR == 30 具有不同的prop_violations 值。 在这种情况下,我想总结这六行,使prop_violations 的剩余值等于这六行的平均值。

所需的输出如下所示:

  ID_WORKES TABL_NOM NAME KOD_DOR KOD_DEPO ID_SP_NAR prop_violations mash_score
1  28029571     9716  Dim      28     9167        20     0.004500341          0
2  28029571     9716  Dim      28     9167        30     0.005604367          0

但还有一件事:如果对于 ID_SP_NAR 的 prop_violations 中的某些重复值,mash_score 的值 >0,则剩下最后一个 mash_score 的值>0

例如。

  ID_WORKES TABL_NOM NAME ID_SP_NAR KOD_DOR KOD_DEPO COLUMN_MASH prop_violations mash_score
1  28029571     9716  Dim        30      28     9167          13          0,0056          0
2  28029571     9716  Dim        30      28     9167          13     0,012048193          0
3  28029571     9716  Dim        30      28     9167          13     0,005617978          0
4  28029571     9716  Dim        30      28     9167          13     0,002932551          1
5  28029571     9716  Dim        30      28     9167          13      0,00591716          0
6  28029571     9716  Dim        30      28     9167          13     0,004950495          0

在这种情况下,对于 ID_SP_NAR=30,prop_violation 将只留下值 0,002932551,因为 mash_score>0 如何达到这个条件?

【问题讨论】:

    标签: r dataframe dplyr data.table


    【解决方案1】:

    使用data.table的选项:

    setDT(kod)
    kod[, {
            if(any(mash_score)>0) {
                i <- which(mash_score>0)[1L]
                .(prop_violations=prop_violations[i], mash_score=mash_score[i])
            } else 
                .(prop_violations=mean(prop_violations), mash_score=mash_score[1L])
        }, 
        .(ID_WORKES, TABL_NOM, NAME, KOD_DOR, KOD_DEPO, ID_SP_NAR)]
    

    输出:

       ID_WORKES TABL_NOM NAME KOD_DOR KOD_DEPO ID_SP_NAR prop_violations mash_score
    1:  28029571     9716  Dim      28     9167        20     0.004500341          0
    2:  28029571     9716  Dim      28     9167        30     0.002932551          1
    

    数据:

    kod <- structure(list(ID_WORKES = c(28029571L, 28029571L, 28029571L, 
        28029571L, 28029571L, 28029571L, 28029571L, 28029571L, 28029571L
    ), TABL_NOM = c(9716L, 9716L, 9716L, 9716L, 9716L, 9716L, 9716L, 
        9716L, 9716L), NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
            1L, 1L), .Label = "Dim", class = "factor"), ID_SP_NAR = c(20L, 
                20L, 20L, 30L, 30L, 30L, 30L, 30L, 30L), KOD_DOR = c(28L, 28L, 
                    28L, 28L, 28L, 28L, 28L, 28L, 28L), KOD_DEPO = c(9167L, 9167L, 
                        9167L, 9167L, 9167L, 9167L, 9167L, 9167L, 9167L), COLUMN_MASH = c(13L, 
                            13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L), prop_violations = c(0.00561797752808989, 
                                0.00293255131964809, 0.00495049504950495, 0.00215982721382289, 
                                0.0120481927710843, 0.00561797752808989, 0.00293255131964809, 
                                0.00591715976331361, 0.00495049504950495), mash_score = c(0L, 
                                    0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L)), row.names = c(NA, -9L), class = "data.frame")
    

    【讨论】:

      【解决方案2】:

      这是使用tidyverse 包的解决方案:

      kod %>% 
        group_by(ID_WORKES, TABL_NOM, NAME, KOD_DOR, KOD_DEPO, ID_SP_NAR) %>%
        summarise(prop_violations = if (all(mash_score == 0)) mean(prop_violations) else last(prop_violations[mash_score > 0]))
      

      如果对于特定组,所有mash_score 都等于零,则返回平均值(使用mean)。如果至少有一个mash_score 大于零,则返回mash_score &gt; 0prop_violations 的最后一个值(使用dplyr::last)。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2015-07-26
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2017-07-24
        • 2015-07-09
        • 2020-06-21
        • 1970-01-01
        相关资源
        最近更新 更多