如何在 R 中删除部分字符变量？答案

【问题标题】：How do I remove part of my character variable in R?如何在 R 中删除部分字符变量？
【发布时间】：2021-12-30 12:13:24
【问题描述】：

我有一个数据集，其中我已经为鸡尾酒成分创建了单独的列，因此一种成分显示在一个列中。现在我有这样的变量：

ingredients <- c("1 1/2 oz Plymouth gin", "1 oz egg white", "3/4 oz lemon juice", "2 oz rye (50% abv)", "2 oz white rum (40% abv)", "3/4 oz lime juice", "3/4 oz honey syrup")

等等

我需要通过去除所有量（例如 1/2 盎司、2 条破折号等）和酒精含量指示器（例如 47.3% abv）来清洁它。我试过一个一个地做（删除数字，然后删除“1/2”和“3/4”，在删除“oz”、“dashes”、“()”和“%”和“ abv"),

df %>%
mutate(ingredient1 = str_remove(ingredient1, "[[:digit:]]+")) %>%
  mutate(ingredient1 = str_remove(ingredient1, "oz"))

但工作量很大，我很确定有一个更优雅、更高效的解决方案。

我正在寻找一种解决方案，我可以告诉 R 删除之前的所有内容，包括“oz”或“dashes”，并删除以“(”开头的所有内容。

【问题讨论】：

请提供您的数据示例：stackoverflow.com/help/minimal-reproducible-example
已添加，谢谢！

标签： r regex

【解决方案1】：

这是您完成任务的起点：

library(dplyr)
library(stringr)
df %>% 
  mutate(across(everything(), ~sub(".*oz ", '', .))) %>%
  mutate(across(everything(), ~sub(".*OZ ", '', .))) %>% 
  mutate(across(everything(), ~str_replace(., " \\s*\\([^\\)]+\\)", "")))

   ingredient1          ingredient2      ingredient3     
   <chr>                <chr>            <chr>           
 1 pisco                egg white        lime juice      
 2 Plymouth gin         egg white        lemon juice     
 3 Plymouth gin         egg white        Dolin dry vermo 
 4 rye                  simple syrup     lemon juice     
 5 white rum            lime juice       simple syrup    
 6 white rum            lime juice       honey syrup     
 7 white rum            lime juice       simple syrup    
 8 Scotch               Cherry Herring   sweet vermouth  
 9 Cognac               heavy cream      Demerara syrup  
10 white rum            lime juice       grapefruit juice
11 bourbon              grapefruit juice honey syrup     
12 Absolut Citron vodka Cointreau        cranberry juice 
13 bourbon              lemon juice      honey syrup

数据：

structure(list(ingredient1 = c("2 oz pisco (40% abv)", "1 1/2 oz Plymouth gin", 
"2 oz Plymouth gin", "2 oz rye (50% abv)", "2 oz white rum (40% abv)", 
"2 oz white rum (40% abv)", "2 oz white rum (40% abv)", "1 oz Scotch (43% abv)", 
"2 oz Cognac (41% abv)", "2 oz white rum (40% abv)", "2 oz bourbon (45% abv)", 
"1 1/2 oz Absolut Citron vodka", "2 OZ bourbon (47% abv)"), ingredient2 = c("1 oz egg white", 
"1 oz egg white", "1 oz egg white", "3/4 oz simple syrup", "0.875 oz lime juice", 
"3/4 oz lime juice", "3/4 oz lime juice", "3/4 oz Cherry Herring", 
"1 oz heavy cream", "3/4 oz lime juice", "1 oz grapefruit juice", 
"3/4 oz Cointreau", "3/4 oz lemon juice"), ingredient3 = c("3/4 oz lime juice", 
"3/4 oz lemon juice", "1/2 oz Dolin dry vermo", "0.625 oz lemon juice", 
"3/4 oz simple syrup", "3/4 oz honey syrup", "3/4 oz simple syrup", 
"3/4 oz sweet vermouth", "1/4 oz Demerara syrup", "1/2 oz grapefruit juice", 
"1/2 oz honey syrup", "3/4 oz cranberry juice", "3/4 oz honey syrup"
)), row.names = c(NA, -13L), spec = structure(list(cols = list(
    ingredient1 = structure(list(), class = c("collector_character", 
    "collector")), ingredient2 = structure(list(), class = c("collector_character", 
    "collector")), ingredient3 = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = "\t"), class = "col_spec"), problems = <pointer: 0x00000179794ebf20>, class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))

【讨论】：

太棒了！还要感谢您在社区中的良好第一次体验:) 我很感激

【解决方案2】：

当您可以通过str_extract使用目标信息的分隔符（左侧和字符串末尾的oz或右侧的(）在一行中执行时，为什么要在多行中执行环视表达式？

library(stringr)
str_extract(ingredients, "(?<=oz\\s).*?(?=\\s\\(|$)")
[1] "Plymouth gin" "egg white"    "lemon juice"  "rye"          "white rum"    "lime juice"  
[7] "honey syrup"

【讨论】：

只有在成分完全处于该位置时才有效，这很可能，但不是给定的。