R：如何总结类别？答案

【问题标题】：R: How to summarize categories?R：如何总结类别？
【发布时间】：2020-04-13 15:01:20
【问题描述】：

我有 60 个类别（称为 CAT）的生命形式（熊、老虎、鲸鱼、树木等），我想为它们分配 10 个汇总类别（称为主题）。

> dt <- fread("lifeforms.csv")
> head(dt)
      CAT COUNT
1:  bears    10
2: tigers     3
3: whales     9

如果不是很多，我会简单地使用：

dt$THEME[dt$CAT=="tigers" | dt$CAT=="bears"]="Mammals"

但是对于我的 60 个不同的 CAT 值和 10 个主题来说，这需要的时间太长而且太混乱了。我在另一个 data.table 中有“查找”表：

> catthemes <- fread("catthemes.csv")
> catthemes
      CAT   THEME
1:  bears Mammals
2: tigers Mammals

请问怎么做？

【问题讨论】：

标签： r categories

【解决方案1】：

CAT <- c("bears", "tigers", "whales", "lizards")
COUNT <- c(10, 3, 9, 15)
THEME <- c("Mammals", "Mammals", "Mammals", "Reptiles")

lifeforms <- data.frame(CAT, COUNT)
catthemes <- data.frame(CAT, THEME)


new_lifeforms <- merge(lifeforms, catthemes, by="CAT")

      CAT COUNT    THEME
1   bears    10  Mammals
2 lizards    15 Reptiles
3  tigers     3  Mammals
4  whales     9  Mammals

【讨论】：

建议：all.x=TRUE
它在一个子集上运行良好，但在我的整个数据集上我得到一个错误：Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 6771794 rows; more than 6733580 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

【解决方案2】：

使用inner_join的选项

library(dplyr)
inner_join(lifeforms, catthemes, by = 'CAT')

【讨论】：

这很好用，它保留了原始的行和列顺序。