【问题标题】:How to collapse/recode a variable in R如何在 R 中折叠/重新编码变量
【发布时间】:2015-12-23 21:54:12
【问题描述】:

我只上 R 入门课程,所以这可能是非常基础的。

我正在使用 Outlook on Life 数据集并对收入感兴趣。受访者必须选择以下 19 个选项之一:

Less than $5,000     
$5,000 to $7,499     
$7,500 to $9,999     
$10,000 to $12,499   
$12,500 to $14,999   
$15,000 to $19,999   
$20,000to $24,999    
$25,000 to $29,999   
$30,000 to $34,999  
$35,000 to $39,999   
$40,000 to $49,999   
$50,000 to $59,999   
$60,000 to $74,999   
$75,000 to $84,999   
$85,000 to $99,999   
$100,000 to $124,999
$125,000 to $149,999 
$150,000 to $174,999
$175,000 or more 

为了让情节更容易理解,我想将其折叠并简化为以下内容:

  1. 在贫困线以下 ($0 - 24,999),
  2. 工薪阶层 ($25,000 - 34,999),
  3. 中下阶层(35,000 - 60,000 美元),
  4. 中产阶级(60,000 - 100,000 美元),
  5. 中上阶层(100,000 - 150,000 美元),
  6. 前 5%(150,000 美元以上)。

我将如何重新编码?

谢谢!

【问题讨论】:

  • 试试剪切功能
  • 您的间隔有问题。如果有人赚了 22,000,他们会选择第 7 组(20k - 24,999)。你会希望他们在贫困线以下。但是赚 24k 的人也会选择第 7 组。但他们在工薪阶层。你怎么知道区别?
  • 是的,有问题。我可以按摩我想要的分组,以便它们更适合预先设定的间隔。所以我可以让 Under Poverty Line 上升到 24,999。然后是工人阶级 34,999。
  • @Katherine:编辑您的代码/问题,以便它提出一个有合理答案的问题。评论不是修改问题的正确方法。

标签: r collapse recode


【解决方案1】:

重新编码因子的最简单方法是意识到levels 函数可以接受可用于重新映射因子水平的值列表。

我假设您的数据已经是一个因素(正如您所说的“受访者必须从以下 19 个选项中选择一个”),这意味着使用 cut 函数实际上没有意义。

这是一个简单的例子:

z <- gl(3, 2, 12) # [1] 1 1 2 2 3 3 1 1 2 2 3 3, Levels: 1 2 3
levels(z) <- list(A = c(1,3), B = 2)
z #  [1] A A B B A A A A B B A A, Levels: A B

从上面的例子可以看出,我们重新编码了1级和3级为A组,2级为B组。所以你的问题可以用类似的方式完成:

groups <- as.factor(sample(c("Less than $5,000",
"$5,000 to $7,499",
"$7,500 to $9,999",
"$10,000 to $12,499",
"$12,500 to $14,999",
"$15,000 to $19,999",
"$20,000to $24,999",
"$25,000 to $29,999",
"$30,000 to $34,999",
"$35,000 to $39,999",
"$40,000 to $49,999",
"$50,000 to $59,999",
"$60,000 to $74,999",
"$75,000 to $84,999",
"$85,000 to $99,999",
"$100,000 to $124,999",
"$125,000 to $149,999",
"$150,000 to $174,999",
"$175,000 or more"), size=100, replace=T))

levels(groups) <- list(
  "Under poverty line"=c("Less than $5,000",
        "$5,000 to $7,499",
        "$7,500 to $9,999",
        "$10,000 to $12,499",
        "$12,500 to $14,999",
        "$15,000 to $19,999",
        "$20,000to $24,999"),
  "Working class"=c("$25,000 to $29,999",
                    "$30,000 to $34,999"),
  "Lower middle class"=c("$35,000 to $39,999",
                         "$40,000 to $49,999",
                         "$50,000 to $59,999"), 
  "Middle class"=c("$60,000 to $74,999",
                   "$75,000 to $84,999",
                   "$85,000 to $99,999"),
  "Upper middle class"=c("$100,000 to $124,999",
                         "$125,000 to $149,999"),
  "Top 5 percent"=c("$150,000 to $174,999",
                    "$175,000 or more")
  )

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-03-17
    • 2015-06-26
    相关资源
    最近更新 更多