【问题标题】:Create New Column in R - Extract regular characters from other column在 R 中创建新列 - 从其他列中提取常规字符
【发布时间】:2019-12-04 07:44:25
【问题描述】:

我在这个线程中有一个与原始海报非常相似的任务: Create new column in dataframe based on partial string matching other column

但是在 TEST 下有 10 种不同的条件。原始线程中有一个建议如何为> 3个条件编码,但是我无法理解如何将其应用于我的数据。

我想创建一个名为 DISTANCE 的列,用于从测试中提取距离。因此,对于名称中包含“0.10m”的任何测试,我希望能够在距离列中包含“0-10m”。如果名称中为“0.20m”,我希望它在 DISTANCE 列中为“0-20m”,依此类推。

PLAYER      SEX     TEST        VALUE             
Player 1    Female    ICE_0.10m    2.100000
Player 1    Female    ICE_0.20m    3.475000
Player 1    Female    ICE_10.20m    1.375000
Player 1    Female    ICE_20.30m    1.246000
Player 1    Female    ICE_0.30m    4.721000
Player 1    Female    ICE_Vel_0.10m    4.761905
Player 1    Female    ICE_Vel_0.20m    5.755396
Player 1    Female    ICE_Vel_10.20m    7.272727
Player 1    Female    ICE_Vel_20.30m    8.025682
Player 1    Female    ICE_Vel_0.30m    6.354586
Player 1    Female    OFF_0.10m    1.983000
Player 1    Female    OFF_0.20m    3.380000
Player 1    Female    OFF_10.20m    1.397000
Player 1    Female    OFF_20.30m    1.380000
Player 1    Female    OFF_0.30m    4.760000
Player 1    Female    OFF_Vel_0.10m    5.042864
Player 1    Female    OFF_Vel_0.20m    5.917160
Player 1    Female    OFF_Vel_10.20m    7.158196
Player 1    Female    OFF_Vel_20.30m    7.246377
Player 1    Female    OFF_Vel_0.30m    6.302521

我试过了,但没用:

SpeedLong$Distance <- ifelse(grepl("0.10m", SpeedLong$Tag, ignore.case = T), "0-10m",
ifelse(grepl("0.20m", SpeedLong$Tag, ignore.case = T), "0-20m",
ifelse(grepl("0.30m", SpeedLong$Tag, ignore.case = T), "0-30m",
ifelse(grepl("0.10m", SpeedLong$Tag, ignore.case = T), "0-10m", "20-30m"))

使用该代码我没有收到错误消息,但它显示控制台中的代码以 + 符号结尾我猜这意味着代码不完整?我不知道 else 和 grepl 是否是解决此问题的最佳方法,因此欢迎提供其他建议!

【问题讨论】:

  • 控制台+是因为括号嵌套不正确或没有正确闭合。我认为您的示例应以 4 结尾:))))。考虑将dplyr::case_when() 作为多个ifelse 的替代方案。
  • 这也可以添加四个 ))))。谢谢!我也会尝试用 case_when() 找到一种方法。下面的解决方案有效,但有其他方法来做某事总是好的。

标签: r subset multiple-columns multiple-conditions


【解决方案1】:

代替嵌套的ifelse,更好的选择是提取匹配的子字符串并使用正则表达式模式将. 更改为-。在这里,我们匹配字符 (.*) 直到 _,将第一组数字 ([0-9]+) 捕获为一组 ((...)),后跟点 (\\. - 点是匹配的元字符任何字符,因此我们将其转义(\\)以获取文字值),然后是另一个捕获组中的第二组数字,并在 replacement 中使用反向引用(\\1\\2)捕获组

library(dplyr)
library(stringr)
df1 %>% 
    mutate(DISTANCE = str_replace(TEST, ".*_([0-9]+)\\.([0-9]+)", "\\1-\\2"))
#     PLAYER    SEX           TEST    VALUE DISTANCE
#1  Player 1 Female      ICE_0.10m 2.100000    0-10m
#2  Player 1 Female      ICE_0.20m 3.475000    0-20m
#3  Player 1 Female     ICE_10.20m 1.375000   10-20m
#4  Player 1 Female     ICE_20.30m 1.246000   20-30m
#5  Player 1 Female      ICE_0.30m 4.721000    0-30m
#6  Player 1 Female  ICE_Vel_0.10m 4.761905    0-10m
#7  Player 1 Female  ICE_Vel_0.20m 5.755396    0-20m
#8  Player 1 Female ICE_Vel_10.20m 7.272727   10-20m
#9  Player 1 Female ICE_Vel_20.30m 8.025682   20-30m
#10 Player 1 Female  ICE_Vel_0.30m 6.354586    0-30m
#11 Player 1 Female      OFF_0.10m 1.983000    0-10m
#12 Player 1 Female      OFF_0.20m 3.380000    0-20m
#13 Player 1 Female     OFF_10.20m 1.397000   10-20m
#14 Player 1 Female     OFF_20.30m 1.380000   20-30m
#15 Player 1 Female      OFF_0.30m 4.760000    0-30m
#16 Player 1 Female  OFF_Vel_0.10m 5.042864    0-10m
#17 Player 1 Female  OFF_Vel_0.20m 5.917160    0-20m
#18 Player 1 Female OFF_Vel_10.20m 7.158196   10-20m
#19 Player 1 Female OFF_Vel_20.30m 7.246377   20-30m
#20 Player 1 Female  OFF_Vel_0.30m 6.302521    0-30m

或使用base R

df1$DISTANCE <- sub(".*_([0-9]+)\\.([0-9]+)", "\\1-\\2", df1$TEST)

数据

df1 <- structure(list(PLAYER = c("Player 1", "Player 1", "Player 1", 
"Player 1", "Player 1", "Player 1", "Player 1", "Player 1", "Player 1", 
"Player 1", "Player 1", "Player 1", "Player 1", "Player 1", "Player 1", 
"Player 1", "Player 1", "Player 1", "Player 1", "Player 1"), 
    SEX = c("Female", "Female", "Female", "Female", "Female", 
    "Female", "Female", "Female", "Female", "Female", "Female", 
    "Female", "Female", "Female", "Female", "Female", "Female", 
    "Female", "Female", "Female"), TEST = c("ICE_0.10m", "ICE_0.20m", 
    "ICE_10.20m", "ICE_20.30m", "ICE_0.30m", "ICE_Vel_0.10m", 
    "ICE_Vel_0.20m", "ICE_Vel_10.20m", "ICE_Vel_20.30m", "ICE_Vel_0.30m", 
    "OFF_0.10m", "OFF_0.20m", "OFF_10.20m", "OFF_20.30m", "OFF_0.30m", 
    "OFF_Vel_0.10m", "OFF_Vel_0.20m", "OFF_Vel_10.20m", "OFF_Vel_20.30m", 
    "OFF_Vel_0.30m"), VALUE = c(2.1, 3.475, 1.375, 1.246, 4.721, 
    4.761905, 5.755396, 7.272727, 8.025682, 6.354586, 1.983, 
    3.38, 1.397, 1.38, 4.76, 5.042864, 5.91716, 7.158196, 7.246377, 
    6.302521)), class = "data.frame", row.names = c(NA, -20L))

【讨论】:

  • 第二组代码成功了!我输入的第一组代码是这样的: SpeedLong %>% mutate(SpeedLong$Distance = str_replace(SpeedLong$Tag, ".*_([0-9]+)\\.([0-9]+)" , "\\1-\\2")) 我收到一条错误消息:错误:列 ==... 的长度必须为 2180(行数)或一,而不是 0
  • @VickiB。在第一组中,除非您指定,否则它不会更新,即df1 &lt;- df1 %&gt;% mutate(..
  • 非常感谢您的帮助!我是 R 新手,不知道任何工作语法的含义,因为我刚刚使用了基本功能。您可能会建议任何资源来了解该代码的含义?
  • @VickiB 如果您指的是正则表达式,那么regular-expressions.info/tutorial.html 会了解语法(因为它在多种语言中使用更通用)。关于 tidyverse,vignettes 可以帮助您理解语法
  • 谢谢@akrun!
猜你喜欢
  • 2023-01-21
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-07-05
  • 2020-06-25
  • 2020-10-03
相关资源
最近更新 更多