【问题标题】:How to replace all values in a column based on an ordered vector in r如何根据r中的有序向量替换列中的所有值
【发布时间】:2019-11-28 17:17:06
【问题描述】:

我正在尝试用有序类别替换数据框列中的所有数值。这是一个虚拟数据框:

df <- data.frame(a = c(1:100), b = sample(c(0,20), size = 100, replace = TRUE), c = c(1:100))

请注意,实际的数据框是使用haven::read_dta() 导入的数据文件。实际数据框可以在 GSS here 上找到。我正在处理 2018 年的文件,并希望将 b 中的所有值(即 0 到 20)替换为一组类别,如下所示:

educ_vec <- c("No formal schooling", "1st grade", "2nd grade", "3rd grade", "4th grade", "5th grade", "6th grade", "7th grade", "8th grade", "9th grade", "10th grade", "11th grade", "12th grade", "1 year of college", "2 years of college", "3 years of college", "4 years of college", "5 years of college", "6 years of college", "7 years of college", "8 years of college")
educ_fac <- factor(educ_vec, ordered = TRUE, levels = educ_vec)

如果我对每个单独的类别都使用mutateifelse,则过程太长,并且不会保留educ_fac中的顺序。我尝试了几种方法来一步完成,但没有成功。 一种方法是:

gss_df %>% 
  mutate(educ = fct_recode(educ, 
                           "No formal schooling" = 0, 
                           "1st grade" = 1, 
                           "2nd grade" = 2, 
                           "3rd grade" = 3, 
                           "4th grade" = 4, 
                           "5th grade" = 5, 
                           "6th grade" = 6, 
                           "7th grade" = 7, 
                           "8th grade" = 8, 
                           "9th grade" = 9, 
                           "10th grade" = 10, 
                           "11th grade" = 11, 
                           "12th grade" = 12, 
                           "1 year of college" = 13, 
                           "2 years of college" = 14, 
                           "3 years of college" = 15, 
                           "4 years of college" = 16, 
                           "5 years of college" = 17, 
                           "6 years of college" = 18, 
                           "7 years of college" = 19, 
                           "8 years of college" = 20))

Error: `f` must be a factor (or character vector or numeric vector).

其他两种方法类似,但也没有成功:

gss_df %>% 
  mutate(educ = fct_recode(educ, educ_fac))

Error: `f` must be a factor (or character vector or numeric vector).
gss_df %>% 
  mutate(educ = recode_factor(educ, educ_vec, ordered = TRUE))

Error in UseMethod("recode") : no applicable method for 'recode' applied to an object of class "haven_labelled"

谁能解决这个问题?

【问题讨论】:

    标签: r


    【解决方案1】:

    由于某些原因,我无法读取 dta 文件,所以下面我模拟数据向您展示我的建议。你从你的 edu_vec 向量开始。

    educ_vec <- c("No formal schooling", "1st grade", 
    "2nd grade", "3rd grade", "4th grade", "5th grade", 
    "6th grade", "7th grade", "8th grade", "9th grade", 
    "10th grade", "11th grade", "12th grade", "1 year of college", 
    "2 years of college", "3 years of college", "4 years of college", 
    "5 years of college", "6 years of college", "7 years of college", 
    "8 years of college")
    

    如果你看educ_vec,它已经是你想要的格式了

    # this is meant for 0
    educ_vec[1]
    [1] "No formal schooling"
    # this is meant for 20
    educ_vec[21]
    [1] "8 years of college"
    

    如果你的分数是 i,新的分类值将是 educ_vec[i+1];所以我们可以在下面使用它:

    set.seed(100)
    gss_df <- data.frame(educ=sample(0:20,30,replace=TRUE))
    gss_df %>% 
    mutate(new=factor(educ_vec[educ+1],ordered = TRUE, levels = educ_vec))
    
       educ                new
    1     9          9th grade
    2     5          5th grade
    3    15 3 years of college
    4    18 6 years of college
    5    13  1 year of college
    6    11         11th grade
    7     5          5th grade
    8     3          3rd grade
    9     5          5th grade
    10    1          1st grade
    11    6          6th grade
    12    6          6th grade
    13   10         10th grade
    14   17 5 years of college
    15   11         11th grade
    16    2          2nd grade
    17   18 6 years of college
    18    7          7th grade
    19   17 5 years of college
    20    1          1st grade
    21   18 6 years of college
    22    3          3rd grade
    23    3          3rd grade
    24   19 7 years of college
    25   15 3 years of college
    26   20 8 years of college
    27    6          6th grade
    28   15 3 years of college
    29   10         10th grade
    30   19 7 years of college
    

    是的,如果在数据中找不到某些因素,它会起作用:

    gss_df <- data.frame(educ=0:5)%>%
    mutate(new=factor(educ_vec[educ+1],ordered = TRUE, levels = educ_vec))
    
      educ                 new
    1    0 No formal schooling
    2    1           1st grade
    3    2           2nd grade
    4    3           3rd grade
    5    4           4th grade
    6    5           5th grade
    

    您可以看到新列是预期类别的一个因素。

    str(gss_df)
    'data.frame':   6 obs. of  2 variables:
     $ educ: int  0 1 2 3 4 5
     $ new : Ord.factor w/ 21 levels "No formal schooling"<..: 1 2 3 4 5 6
    

    如果您的分数不在 0-20 之间,例如 -1、-2 或 21,22 等。那么我建议您执行以下操作:

    names(educ_vec) = 0:20
    gss_df <- data.frame(educ=c(-1,0,20,21))
    # you can also use mutate
    gss_df$new <- educ_vec[match(gss_df$educ,names(educ_vec))]
    gss_df
    
      educ                 new
    1   -1                <NA>
    2    0 No formal schooling
    3   20  8 years of college
    4   21                <NA>
    

    如果在你的 educ_vec 中找不到对应的名字,Match 会返回一个 NA

    【讨论】:

    • 它可以工作,即使我用educ 替换new 以避免添加新列。我可以知道这部分educ_vec[educ+1] 是什么意思吗?另外,如果educ_vec 中的某些因素在列中没有找到,它是否仍然有效?
    • 嗨@EricAtani,好的,我在更新的答案中添加了更多解释。如果你清楚吗?
    • 我明白了。所以这是因为educ中的值是从0到20,我们需要加1才能匹配educ_vec中的值,对吧?
    【解决方案2】:

    解决该问题的另一种方法是使用命名向量并稍后进行因子排序。将.dta 文件读取到工作区后,有多种方法可以解决此问题。

    set.seed(777)
    library(tidyverse)
    df <- data.frame(a = c(1:100), b = sample(c(0:20), size = 100, replace = TRUE), c = c(1:100))
    
    # -------------------------------------------------------------------------
    head(df)
    #   a  b c
    # 1 1  0 1
    # 2 2 18 2
    # 3 3 11 3
    # 4 4  9 4
    # 5 5 11 5
    # 6 6  8 6
    
    # -------------------------------------------------------------------------
    
    # this will be used as name istead
    educ_vec <- c("No formal schooling", "1st grade", "2nd grade", "3rd grade", "4th grade", "5th grade", "6th grade", "7th grade", "8th grade", "9th grade", "10th grade", "11th grade", "12th grade", "1 year of college", "2 years of college", "3 years of college", "4 years of college", "5 years of college", "6 years of college", "7 years of college", "8 years of college")
    
    # alues as char from 0 to 20
    value_vec <- as.character(seq(21)-1)
    
    # assign educ_vec as names 
    names(value_vec) <- educ_vec
    
    # fct_recode b
    df$educ <- fct_recode(factor(df$b), !!!value_vec)
    
    # set educ as ordered factor using educ_vec as levels
    df$educ <- factor(df$educ, ordered = TRUE, levels = educ_vec)
    
    # -------------------------------------------------------------------------
    head(df)
    #   a  b c                educ
    # 1 1  0 1 No formal schooling
    # 2 2 18 2  6 years of college
    # 3 3 11 3          11th grade
    # 4 4  9 4           9th grade
    # 5 5 11 5          11th grade
    # 6 6  8 6           8th grade
    
    # -------------------------------------------------------------------------
    
    
    

    【讨论】:

    • 第二步和第三步是怎么做的?我以为value_vec 是一个字符向量,数字是怎么改变的?还有!!! 在这里是什么意思? (对不起,我是 r 新手)
    • 你说得对,value_vec 是一个命名字符。在每个步骤打印输出可能有助于了解每个步骤在做什么。 educ_vec 用作value_vec 的名称,您可以使用names(value_vec) 查看。关于三重爆炸,值得检查Unquoting many arguments
    猜你喜欢
    • 2022-07-14
    • 2021-06-12
    • 2022-12-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-10-09
    • 2021-10-04
    • 2021-08-05
    相关资源
    最近更新 更多