【问题标题】:Multiple if else statements to classify existing values in a new column in R多个 if else 语句对 R 中新列中的现有值进行分类
【发布时间】:2021-05-05 20:49:01
【问题描述】:

我已经尝试复制这篇文章中描述的方法(Trying to create a new column using multiple if else statements in R

我想对患者血液检查的严重程度进行分类。我的目的是为每个患者的血液工作值给一个已经存在的值一个特定的分数(即 0、1、2、3)。之后,我想将这些新值保存到新列中。 截止值是:

if value is >=150000, score = 0
if value is <150000, score = 1
if value is <100000, score = 2
if value is <50000, score = 3
if value is <20000, score = 4

输入是

> dput (platelets_v1)
structure(list(ID = c(13055908, 13059026, 13154920, 13201107, 
13207119, 13207948, 13234892, 13261022, 13082943, 13193903, 13259391, 
13283776, 13262499, 13154288, 13207315, 13269178, 13135316, 13055690, 
13207670, 13220627, 13233898, 13055009, 13044947, 13181075, 13261607, 
13186960, 13240091, 13060589, 13201616, 13260671, 13302375, 13021555, 
13054278, 13062360, 13035346, 13077712, 13128769, 13267480, 13160156, 
13040172, 13160971, 13239318, 12977871, 13090190, 13321288, 13040530, 
13100979, 13124511, 13192142, 13289317, 13315577, 13154966, 13044653, 
13079694, 13128639, 13165362, 13207352, 13049409, 12999835, 13210994, 
13283675, 13223721, 13064865, 13104602, 13036280, 13040507, 12964437, 
13029805, 13029001, 12993036, 13072516, 13060586, 13119819, 13040632
), platelets = c("469.000", "NA", "NA", "243.000", "NA", "NA", 
"NA", "334.000", "522.000", "NA", "NA", "NA", "NA", "312.000", 
"421.000", "NA", "321.000", "NA", "NA", "NA", "298.000", "263.000", 
"109.000", "280.000", "NA", "NA", "430.000", "288.000", "159.000", 
"528.000", "NA", "163.000", "NA", "439.000", "NA", "477.000", 
"NA", "473.000", "NA", "459.000", "183.000", "343.000", "285.000", 
"459.000", "253.000", "NA", "227.000", "NA", "569.000", "NA", 
"NA", "NA", "239.000", "382.000", "270.000", "NA", "362.000", 
"NA", "146.000", "367.000", "NA", "531.000", "NA", "363000", 
"NA", "257000", "158000", "56000", "417", "NA", "171000", "NA", 
"NA", "NA")), row.names = c(NA, -74L), class = c("tbl_df", "tbl", 
"data.frame"))

我尝试了以下方法:

> labels <- c('0', '1', '2','3', '4')
> breaks <- c(500000, 150000, 100000, 50000, 20000)
> teste01 <- platelets_v1 %>% mutate(platelets_v1 = cut(platelets_v1, breaks = breaks, labels = labels, include.lowest = TRUE))

想要的结果:

ID platelets score
13055908 469000 0
13059026 NA NA
13154920 NA NA
13201107 243000 0

等等

任何灯光都将不胜感激。

【问题讨论】:

  • 使用case_when或cut()
  • 由于输出是数字,findInterval 是最好的选择。
  • 数据框中的血小板似乎是字符类型

标签: r dataframe if-statement dplyr


【解决方案1】:
platelets_v1 %>%
  mutate(
    platelets = suppressWarnings(as.numeric(platelets)),
    bin = 5L - cut(platelets,
                   c(0, 20000, 50000, 100000, 150000, Inf), labels = FALSE)
  ) %>%
  slice(c(1:5, n() - 0:4))
# # A tibble: 10 x 3
#          ID platelets   bin
#       <dbl> <chr>     <int>
#  1 13055908 469.000       4
#  2 13059026 NA           NA
#  3 13154920 NA           NA
#  4 13201107 243.000       4
#  5 13207119 NA           NA
#  6 13040632 NA           NA
#  7 13119819 NA           NA
#  8 13060586 NA           NA
#  9 13072516 171000        0
# 10 12993036 NA           NA

platelets_v1 %>%
  mutate(
    platelets = suppressWarnings(as.numeric(platelets)),
    bin = 5L - findInterval(platelets,
                            c(0, 20000, 50000, 100000, 150000, Inf))
  )

但是,如果您想要找到可能不完全对齐的范围(左闭右开)的通用能力,那么

platelets_v1 %>%
  mutate(
    platelets = suppressWarnings(as.numeric(platelets)),
    bin = case_when(
      platelets < 20000 ~ 4, 
      platelets < 50000 ~ 3, 
      platelets < 100000 ~ 2, 
      platelets < 150000 ~ 1, 
      platelets >= 150000 ~ 0)
    )

这里的顺序很重要,因为如果你把它们颠倒过来,一切都会是01(和NA)。此外,您可能会想使用between ...意识到这在两侧都是“封闭的”,因此between(platelets, 20000, 50000) 相当于20000 &lt;= platelets &amp; platelets &lt;= 50000,您的逻辑表明您更喜欢... &amp; platelets &lt; 50000

此外,人们可能会想用TRUE ~ 0 替换platelets &gt;= 150000 ~ 0,因为假设所有剩余的值都必须属于该类别。由于您的数据包含NA,因此我建议您不要这样做,而对于未满足的条件,更愿意保留NA 的默认值。

【讨论】:

  • 这绝对是我所需要的。太感谢了。我不会用TRUE ~0 替换platelets &gt;= 15000 ~0,因为我需要公开NA。我发现case_when 解决方案非常优雅且易于理解。一个问题是为什么顺序很重要,显示 0 和 1?这是否意味着每当我需要对一组值进行分类时,我必须从最高间隔和最高分开始(例如platelets &lt; 20000 ~ 4)?
  • case_when 的一个前提是保留第一个为真的条件。如果您有&lt; 100000,然后是&lt; 1000,则永远不会使用第二个,因为每个小于 1000 的数字也都小于 100000(显然)。
  • 一个简单的验证方法是颠倒上面case_when代码块中的条件顺序并重新运行......并看到您的所有数据将只包含NA,@ 987654345@,和1
  • 我已经颠倒了条件的顺序,它在我需要对另一个血液工作进行分类的特定情况下工作得很好。太感谢了。如果可能的话,请告诉我如何支付您一杯咖啡以表示感谢。
猜你喜欢
  • 2020-11-09
  • 2021-03-16
  • 1970-01-01
  • 2014-09-06
  • 2016-05-27
  • 1970-01-01
  • 2020-09-18
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多