用 R 重新编码变量答案

【问题标题】：Recoding variables with R用 R 重新编码变量
【发布时间】：2011-03-21 01:13:32
【问题描述】：

在 R 中重新编码变量似乎是我最头疼的问题。您使用哪些功能、软件包和流程来确保获得最佳结果？

我在 Internet 上找到的有用示例很少，它们提供了一种万能的重新编码解决方案，我很想看看你们正在使用什么。

注意：这可能是一个社区 wiki 主题。

【问题讨论】：

重新编码因子、数值、将连续变量分箱为离散类别，以上所有（以及更多）？
@Chase，这个问题是故意宽泛的，因为我想尽可能多地收集这个常见问题的可能解决方案。
Brandon Bertelsen，我只听说过通常意义上的“重新编码”“重命名分类标签/重新排序分类级别/交换级别标签”。永远不要为“将连续变量转换为离散类别”，这是分箱，而不是重新编码。也不用于更改切割阈值或分位数。您需要说明一些特定的用例并显示一些示例代码或数据。否则这是 a) 太模糊和 b) 一个可怕的规范。顺便说一句，谷歌也没有，Wikipedia isn't aware of this meaning of 'recoding'
@smci 欢迎您对这个 7 年前的问题提出修改建议。

标签： r

【解决方案1】：

重新编码可能意味着很多事情，而且从根本上来说是复杂的。

可以使用levels 函数更改因子的水平：

> #change the levels of a factor
> levels(veteran$celltype) <- c("s","sc","a","l")

转换连续变量只涉及矢量化函数的应用：

> mtcars$mpg.log <- log(mtcars$mpg)

要对连续数据进行分箱，请查看 cut 和 cut2（在 hmisc 包中）。例如：

> #make 4 groups with equal sample sizes
> mtcars[['mpg.tr']] <- cut2(mtcars[['mpg']], g=4)
> #make 4 groups with equal bin width
> mtcars[['mpg.tr2']] <- cut(mtcars[['mpg']],4, include.lowest=TRUE)

为了将连续变量或因子变量重新编码为分类变量，car 包中有 recode，Deducer 包中有 recode.variables

> mtcars[c("mpg.tr2")] <- recode.variables(mtcars[c("mpg")] , "Lo:14 -> 'low';14:24 -> 'mid';else -> 'high';")

如果您正在寻找 GUI，Deducer 会使用 Transform 和 Recode 对话框实现重新编码：

http://www.deducer.org/pmwiki/pmwiki.php?n=Main.TransformVariables

http://www.deducer.org/pmwiki/pmwiki.php?n=Main.RecodeVariables

【讨论】：

我也喜欢car 包中的recode 函数。它还可以用于将一组类别映射到另一组（例如，当您想将一堆小类别折叠成“其他”类别时）。
在重新编码一个因子的水平时，我经常使用dput(levels(var))，然后粘贴并编辑输出，然后将其发送给levels(var)<-。我觉得这很方便。

【解决方案2】：

我发现mapvalues from plyr package 非常方便。包还包含函数revalue，类似于car:::recode。

以下示例将“重新编码”

> mapvalues(letters, from = c("r", "o", "m", "a", "n"), to = c("R", "O", "M", "A", "N"))
 [1] "A" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "M" "N" "O" "p" "q" "R" "s" "t" "u" "v" "w" "x" "y" "z"

【讨论】：

【解决方案3】：

当需要转换多个值时，我发现这非常方便（就像在 Stata 中进行重新编码一样）：

# load package and gen some data
require(car)
x <- 1:10

# do the recoding
x
## [1]   1   2   3   4   5   6   7   8   9  10

recode(x,"10=1; 9=2; 1:4=-99")
## [1] -99 -99 -99 -99   5   6   7   8   2   1

【讨论】：

【解决方案4】：

例如，我发现有时在尝试更改非数字因子之前将其转换为字符会更容易。

df <- data.frame(example=letters[1:26]) 
example <- as.character(df$example)
example[example %in% letters[1:20]] <- "a"
example[example %in% letters[21:26]] <- "b"

此外，在导入数据时，在尝试转换之前确保数字实际上是数字会很有用：

df <- data.frame(example=1:100)
example <- as.numeric(df$example)
example[example < 20] <- 1
example[example >= 20 & example < 80] <- 2
example[example >= 80] <- 3

【讨论】：

【解决方案5】：

当您想重新编码一个因子的水平时，forcats 可能会派上用场。您可以阅读 a chapter of R for Data Science 以获得详尽的教程，但这里是它的要点。

library(tidyverse)
library(forcats)
gss_cat %>%
  mutate(partyid = fct_recode(partyid,
                           "Republican, strong"    = "Strong republican",
                           "Republican, weak"      = "Not str republican",
                           "Independent, near rep" = "Ind,near rep",
                           "Independent, near dem" = "Ind,near dem",
                           "Democrat, weak"        = "Not str democrat",
                           "Democrat, strong"      = "Strong democrat",
                           "Other"                 = "No answer",
                           "Other"                 = "Don't know",
                           "Other"                 = "Other party"
  )) %>%
  count(partyid)
#> # A tibble: 8 × 2
#>                 partyid     n
#>                  <fctr> <int>
#> 1                 Other   548
#> 2    Republican, strong  2314
#> 3      Republican, weak  3032
#> 4 Independent, near rep  1791
#> 5           Independent  4119
#> 6 Independent, near dem  2499
#> # ... with 2 more rows

您甚至可以让 R 决定将哪些类别（因子级别）合并在一起。

有时您只想将所有小组放在一起以使绘图或表格更简单。这就是 fct_lump() 的工作。 [...] 默认行为是逐步将最小的组集中在一起，确保聚合仍然是最小的组。

gss_cat %>%
  mutate(relig = fct_lump(relig, n = 10)) %>%
  count(relig, sort = TRUE) %>%
  print(n = Inf)
#> # A tibble: 2 × 2
#>        relig     n
#>       <fctr> <int>
#> 1 Protestant 10846
#> 2      Other 10637

【讨论】：

【解决方案6】：

考虑这个示例数据。

df <- data.frame(a = 1:5, b = 5:1)
df
#  a b
#1 1 5
#2 2 4
#3 3 3
#4 4 2
#5 5 1

这里有两个选项-

1. case_when：

对于单列 -

library(dplyr)

df %>%
  mutate(a = case_when(a == 1 ~ 'a', 
                       a == 2 ~ 'b', 
                       a == 3 ~ 'c', 
                       a == 4 ~ 'd', 
                       a == 5 ~ 'e'))

#  a b
#1 a 5
#2 b 4
#3 c 3
#4 d 2
#5 e 1

对于多列 -

df %>%
  mutate(across(c(a, b), ~case_when(. == 1 ~ 'a', 
                                    . == 2 ~ 'b', 
                                    . == 3 ~ 'c', 
                                    . == 4 ~ 'd', 
                                    . == 5 ~ 'e')))

#  a b
#1 a e
#2 b d
#3 c c
#4 d b
#5 e a

2。 dplyr::recode：

对于单列 -

df %>%
  mutate(a = recode(a, '1' = 'a', '2' = 'b', '3' = 'c', '4' = 'd', '5' = 'e'))

对于多列 -

df %>%
  mutate(across(c(a, b), 
         ~recode(., '1' = 'a', '2' = 'b', '3' = 'c', '4' = 'd', '5' = 'e')))

【讨论】：

【解决方案7】：

使用 setNames 创建一个查找向量，然后匹配名称：

# iris as an example data
table(iris$Species)
# setosa versicolor  virginica 
#     50         50         50

x <- setNames(c("x","y","z"), c("setosa","versicolor","virginica"))
iris$Species <- x[ iris$Species ]

table(iris$Species)
#  x  y  z 
# 50 50 50

【讨论】：