【问题标题】:how to delete the first word and the last in a column?如何删除列中的第一个单词和最后一个单词?
【发布时间】:2021-03-18 05:38:27
【问题描述】:

我正在尝试删除 R 中 CCGName 列中的第一个单词和最后一个单词,仅使用 tidyverse。CCG 列包含单词“NHS”,城市名称后跟“CCG”。我想摆脱“NHS”和“CCG”这两个词。有没有办法只用 tidyverse 做到这一点?

这是我的数据样本:

structure(list(SiteType = c(111, 111, 111, 111, 111, 111, 111, 
111, 111, 111), `Call Date` = c("18/03/2020", "18/03/2020", "18/03/2020", 
"18/03/2020", "18/03/2020", "18/03/2020", "18/03/2020", "18/03/2020", 
"18/03/2020", "18/03/2020"), Gender = c("Female", "Female", "Female", 
"Female", "Female", "Female", "Female", "Female", "Female", "Female"
), AgeBand = c("0-18 years", "0-18 years", "0-18 years", "0-18 years", 
"0-18 years", "0-18 years", "0-18 years", "0-18 years", "0-18 years", 
"0-18 years"), CCGCode = c("E38000004", "E38000009", "E38000020", 
"E38000023", "E38000029", "E38000010", "E38000030", "E38000035", 
"E38000008", "E38000025"), CCGName = c("NHS Barking and Dagenham CCG", 
"NHS Bath and North East Somerset CCG", "NHS Brent CCG", "NHS Bromley CCG", 
"NHS Canterbury and Coastal CCG", "NHS Bedfordshire CCG", "NHS Castle Point and Rochford CCG", 
"NHS City and Hackney CCG", "NHS Bassetlaw CCG", "NHS Calderdale CCG"
), `April20 mapped CCGCode` = c("E38000004", "E38000231", "E38000020", 
"E38000244", "E38000237", "E38000010", "E38000030", "E38000035", 
"E38000008", "E38000025"), `April20 mapped CCGName` = c("NHS Barking and Dagenham CCG", 
"NHS Bath and North East Somerset, Swindon and Wiltshire CCG", 
"NHS Brent CCG", "NHS South East London CCG", "NHS Kent and Medway CCG", 
"NHS Bedfordshire CCG", "NHS Castle Point and Rochford CCG", 
"NHS City and Hackney CCG", "NHS Bassetlaw CCG", "NHS Calderdale CCG"
), TriageCount = c(35, 9, 21, 11, 11, 27, 12, 12, 6, 9)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

【问题讨论】:

  • 这个问题很清楚地说明了 tidyverse ...
  • 我认为“非 tidyverse”是值得商榷的。 (以前)接受的答案确实使用了 base-R 函数(trimwsgsub),但整体框架是 tidyverse-ish(mutate + %>%
  • 这个怎么样 - "^\\w+\\s+(.*)\\s+\\w+", '\\1' 。这不是作为基础R吗?我只是看到投票给这个确实有点挑战-
  • @BenBolker 我明白你的意思。我的要求是更改问题标题,以免混淆,因为仅使用 tidyverse 手段,利用这些包中提供的功能。另外,OP提到了Is there a way to do this only with tidyverse?
  • @GaB 这是您展示的正则表达式。在str_replace中使用,可以是gsubsub,但stringr是一个tidyverse包

标签: r string tidyverse


【解决方案1】:

我们可以使用str_replace来匹配第一个单词和空格之后的字符,作为一个组捕获并替换为捕获组的反向引用

library(dplyr)
library(stringr)
df2 <- df %>% 
      mutate(CCGName = str_replace(CCGName, "^\\w+\\s+(.*)\\s+\\w+", '\\1'))

或者使用来自base Rtrimws

trimws(df$CCGName, whitespace = "\\s*(NHS|CCG)\\s*")

注意:这仅使用 tidyverse 解决方案作为帖子中提到的 OP。此外,它是一个通用解决方案,它可以删除任何第一个和最后一个单词

-输出

df2$CCGName
#[1] "Barking and Dagenham"         "Bath and North East Somerset" "Brent"                        "Bromley"                     
#[5] "Canterbury and Coastal"       "Bedfordshire"                 "Castle Point and Rochford"    "City and Hackney"            
#[9] "Bassetlaw"                    "Calderdale"

【讨论】:

    【解决方案2】:

    你也可以试试:

    library(dplyr)
    #Code
    df <- df %>% mutate(CCGName=trimws(gsub('NHS|CCG','',CCGName)))
    

    输出:

    df$CCGName
     [1] "Barking and Dagenham"         "Bath and North East Somerset"
     [3] "Brent"                        "Bromley"                     
     [5] "Canterbury and Coastal"       "Bedfordshire"                
     [7] "Castle Point and Rochford"    "City and Hackney"            
     [9] "Bassetlaw"                    "Calderdale"  
    

    您还可以使用下一个代码达到相同的输出(非常感谢并感谢 @BenBolker):

    #Code 2
    df <- df %>% mutate(CCGName=str_remove("^NHS\\s+|\\s+CCG$",string = CCGName))
    

    【讨论】:

    • 这真的很优雅,鸭子。谢谢
    • @GaB 总是乐于助人!
    • @BenBolker 非常有效的 Bolker 博士!我会将其添加到解决方案中。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-04-03
    • 1970-01-01
    • 2010-12-15
    • 2015-03-03
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多