【问题标题】:R - identify duplicates based on two columns, find values, and delete cases with specific valueR - 根据两列识别重复,查找值,并删除具有特定值的案例
【发布时间】:2020-01-31 20:22:15
【问题描述】:

我正在分析一个包含生态数据的数据库。这是一个例子:

df <- data.frame(observationID = c("06a4dcc1-a2c1-f1a9-3964-4374c3a26e2a","b8431c2b-fa18-42bf-b2c9-3dc23d308b44","b8431c2b-fa18-42bf-b2c9-3dc23d308b44","ff8a8b93-f307-4695-ad95-1915c2c46c60","ff8a8b93-f307-4695-ad95-1915c2c46c60","c240564d-a100-4cdb-8a81-8ac197a45e8b","c240564d-a100-4cdb-8a81-8ac197a45e8b","f0a18902-fd16-4d82-bc3a-10bd47454dff","f0a18902-fd16-4d82-bc3a-10bd47454dff","f0a18902-fd16-4d82-bc3a-10bd47454dff"),
               animalVernacularName = c("wild boar","Horse","Horse","Horse","Horse","Common Buzzard","Common Buzzard","wild boar","wild boar","Fox"),
               behav = c("1","1","2","1","2","1","1","1","1","2"),
               value = c("Passing","Interest","Intraspecific interaction","Interest","Intraspecific interaction","Interest","Intraspecific interaction","Eating","Intraspecific interaction","Eating"))

我想根据两个变量(“observationID”和“behav”)识别重复项,然后找到这些重复项的“observationID”值,并删除所有具有该“observationID”值的案例。不仅是两个重复项之一,而且所有具有“observationID”的案例(可以有更多的案例,而不仅仅是重复项)。我需要删除具有此“观察 ID”的所有案例,因为整个观察(由多个案例组成)输入错误。

仅识别重复项不是问题,但也需要让 R 给我这些重复项的 'observationID' 值。

有一些简单的方法可以在两列中查找重复项。比如我试过

dupe <- duplicated(df[c("observationID","behav")])

这里它标识了重复项,但我没有看到如何找到相应的“observationID”值的选项。

这样做

test <- pivot_wider(df, names_from = behav, values_from = value, names_prefix = "behav", values_fn = list(value = length))

我确实找到了重复项并看到了相应的“observationID”,但我找不到让 R 返回这些值的方法,所以我可以删除观察结果。

我正在寻找一种方法,让 R 向我返回一个“observationID”列表,即根据“observationID”和“behav”列找到的重复项的值。在这个例子中,我正在寻找一种方法来删除所有带有“observationID”的案例:

"c240564d-a100-4cdb-8a81-8ac197a45e8b"
"f0a18902-fd16-4d82-bc3a-10bd47454dff"

然后我可以将这个列表用于我的数据集的 filter()。

所以最终,我希望得到以下结果。

df_result <- data.frame(observationID = c("06a4dcc1-a2c1-f1a9-3964-4374c3a26e2a","b8431c2b-fa18-42bf-b2c9-3dc23d308b44","b8431c2b-fa18-42bf-b2c9-3dc23d308b44","ff8a8b93-f307-4695-ad95-1915c2c46c60","ff8a8b93-f307-4695-ad95-1915c2c46c60"),
             animalVernacularName = c("wild boar","Horse","Horse","Horse","Horse"),
             behav = c("1","1","2","1","2"),
             value = c("Passing","Interest","Intraspecific interaction","Interest","Intraspecific interaction"))

【问题讨论】:

  • 您在寻找df$observationID[dupe]吗?

标签: r filter duplicates


【解决方案1】:

另一个选项是filter

library(dplyr)
df %>%
     filter(!observationID %in% observationID[dupe])

【讨论】:

    【解决方案2】:

    几个选项。

    df[ ! df$observationID %in% df$observationID[dupe], ]
    #                          observationID animalVernacularName behav
    # 1 06a4dcc1-a2c1-f1a9-3964-4374c3a26e2a            wild boar     1
    # 2 b8431c2b-fa18-42bf-b2c9-3dc23d308b44                Horse     1
    # 3 b8431c2b-fa18-42bf-b2c9-3dc23d308b44                Horse     2
    # 4 ff8a8b93-f307-4695-ad95-1915c2c46c60                Horse     1
    # 5 ff8a8b93-f307-4695-ad95-1915c2c46c60                Horse     2
    #                       value
    # 1                   Passing
    # 2                  Interest
    # 3 Intraspecific interaction
    # 4                  Interest
    # 5 Intraspecific interaction
    
    ### or
    dplyr::anti_join(df, df[dupe,"observationID",drop=FALSE], by = "observationID")
    

    【讨论】:

    • 感谢您的快速回复。答案非常简短,简单明了。我研究了它们,我明白它是如何工作的。我又学到了一些新东西。
    猜你喜欢
    • 2021-12-21
    • 2021-09-11
    • 2020-01-18
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-10-31
    • 2020-07-08
    • 2013-06-18
    相关资源
    最近更新 更多