【问题标题】:R: grouping character values and leave only one value from the vector by condition [duplicate]R:对字符值进行分组,并按条件从向量中仅保留一个值[重复]
【发布时间】:2020-04-29 20:37:30
【问题描述】:

例如,我有以下数据集(我的真实数据集有超过 100000 行和 70 个变量):

Country   Year   Flag
Norway    2018   drop: reason1
Norway    2018   drop: reason2
Sweden    2016   drop: reason3
France    2011   drop: reason2
France    2011   drop: reason3
France    2011   drop: reason4

首先,我想通过变量CountryYearFlag values进行分组,所以我想得到一个这样的表格: p>

Country   Year   Flag
Norway    2018   drop: reason1, drop: reason2
Sweden    2016   drop: reason3
France    2011   drop: reason2, drop: reason3, drop: reason4

其次,如果Flag列有多个值,我想只留下1,逻辑如下: 如果存在drop: reason1,则将其保留并删除其余部分。如果没有drop: reason1,但有drop: reason2drop: reason3,那么我们只留下drop: reason2

最后,我的数据集应该是这样的:

Country   Year   Flag
Norway    2018   drop: reason1
Sweden    2016   drop: reason3
France    2011   drop: reason2

我想基于 data.table 或 base R 方法来实现它。

如果有任何帮助,我将不胜感激! 至少对于问题的第一部分。

【问题讨论】:

    标签: r data.table character aggregate grouping


    【解决方案1】:

    我们可以通过CountryFlagorder数据,然后为每个CountryYear选择Flag的第一个值。

    这可以在基础 R 中完成:

    aggregate(Flag~Country+Year, df[with(df, order(Country, Flag)), ], head, 1)
    
    #  Country Year         Flag
    #1  France 2011 drop:reason2
    #2  Sweden 2016 drop:reason3
    #3  Norway 2018 drop:reason1
    

    dplyr

    library(dplyr)
    
    df %>%
      arrange(Country, Flag) %>%
      group_by(Country, Year) %>%
      summarise(Flag = first(Flag))
    

    data.table

    library(data.table)
    setDT(df)
    df[order(Country, Flag), (Flag = first(Flag)), .(Country, Year)]
    

    数据

    df <- structure(list(Country = structure(c(2L, 2L, 3L, 1L, 1L, 1L),
    .Label = c("France","Norway", "Sweden"), class = "factor"), Year = c(2018L, 2018L, 
    2016L, 2011L, 2011L, 2011L), Flag = structure(c(1L, 2L, 3L, 2L, 
    3L, 4L), .Label = c("drop:reason1", "drop:reason2", "drop:reason3", 
    "drop:reason4"), class = "factor")), class = "data.frame", row.names = c(NA, -6L))
    

    【讨论】:

    • 非常感谢!!!以及如何从我的问题中获得第一个表(带有 all 可能的标志值唯一变体)?抱歉,我是 R 新手 :((
    • @Hilary 你可以做aggregate(Flag~Country+Year, df, function(x) toString(unique(x)))
    猜你喜欢
    • 2016-03-20
    • 2017-08-01
    • 2020-11-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-04-14
    • 1970-01-01
    相关资源
    最近更新 更多