【问题标题】:R: How to summarize categories?R:如何总结类别?
【发布时间】:2020-04-13 15:01:20
【问题描述】:

我有 60 个类别(称为 CAT)的生命形式(熊、老虎、鲸鱼、树木等),我想为它们分配 10 个汇总类别(称为主题)。

> dt <- fread("lifeforms.csv")
> head(dt)
      CAT COUNT
1:  bears    10
2: tigers     3
3: whales     9

如果不是很多,我会简单地使用:

dt$THEME[dt$CAT=="tigers" | dt$CAT=="bears"]="Mammals"

但是对于我的 60 个不同的 CAT 值和 10 个主题来说,这需要的时间太长而且太混乱了。我在另一个 data.table 中有“查找”表:

> catthemes <- fread("catthemes.csv")
> catthemes
      CAT   THEME
1:  bears Mammals
2: tigers Mammals

请问怎么做?

【问题讨论】:

    标签: r categories


    【解决方案1】:
    CAT <- c("bears", "tigers", "whales", "lizards")
    COUNT <- c(10, 3, 9, 15)
    THEME <- c("Mammals", "Mammals", "Mammals", "Reptiles")
    
    lifeforms <- data.frame(CAT, COUNT)
    catthemes <- data.frame(CAT, THEME)
    
    
    new_lifeforms <- merge(lifeforms, catthemes, by="CAT")
    
          CAT COUNT    THEME
    1   bears    10  Mammals
    2 lizards    15 Reptiles
    3  tigers     3  Mammals
    4  whales     9  Mammals
    

    【讨论】:

    • 建议:all.x=TRUE
    • 它在一个子集上运行良好,但在我的整个数据集上我得到一个错误:Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 6771794 rows; more than 6733580 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
    【解决方案2】:

    使用inner_join的选项

    library(dplyr)
    inner_join(lifeforms, catthemes, by = 'CAT')
    

    【讨论】:

    • 这很好用,它保留了原始的行和列顺序。
    猜你喜欢
    • 2020-05-09
    • 1970-01-01
    • 2013-05-12
    • 2019-06-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多