【问题标题】:Remove specific words conditionnally in R在 R 中有条件地删除特定单词
【发布时间】:2023-03-30 19:39:02
【问题描述】:

我正在尝试根据特定条件删除句子中的单词列表。

假设我们有这个数据框:

responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- cbind(questions,responses)

> df
     questions                           responses          
[1,] "The highest mountain in the world" "The Himalaya"     
[2,] "A cold war serie from 2013"        "The Americans"    
[3,] "A kiwi which is not a fruit"       "A bird"           
[4,] "Widest liquid area on earth"       "The Pacific ocean"

以及下面的具体单词列表:

articles <- c("The","A")
geowords <- c("mountain","liquid area")

我想做两件事:

  1. 删除响应列中第一个位置的文章当与以小写字母开头的单词相邻时

  2. 删除响应列中第一个位置的文章

预期的结果应该是:

     questions                           responses      
[1,] "The highest mountain in the world" "Himalaya"     
[2,] "A cold war serie from 2013"        "The Americans"
[3,] "A kiwi which is not a fruit"       "bird"         
[4,] "Widest liquid area on earth"       "Pacific ocean"

我会尝试 gsub 但没有成功,因为我对正则表达式一点也不熟悉... 我在 Stackoverflow 中进行了搜索,但没有发现真正类似的问题。如果 R 和正则表达式全明星可以帮助我,我将非常感激!

【问题讨论】:

  • 你是如何在The Amiericans 中获得The 的?

标签: r regex dataframe text


【解决方案1】:

和你提到的一样写成两个逻辑列,ifelse用于验证,gsub

responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- data.frame(cbind(questions,responses), stringsAsFactors = F)

df

articles <- c("The ","A ")
geowords <- c("mountain","liquid area")


df$f_caps <- unlist(lapply(df$responses, function(x) {grepl('[A-Z]',str_split(str_split(x,' ', simplify = T)[2],'',simplify = T)[1])}))


df$geoword_flag <- grepl(paste(geowords,collapse='|'),df[,1])


df$new_responses <- ifelse((df$f_caps & df$geoword_flag) | !df$f_caps, 
                     {gsub(paste(articles,collapse='|'),'', df$responses )  },
                     df$responses)

df$new_responses


> df$new_responses
[1] "Himalaya"      "The Americans" "bird"          "Pacific ocean"

【讨论】:

  • 感谢 amrrs,你是大师(我还远远没有让代码工作)。只有一个问题:我真的不明白 stringAsFactors = F 是如何工作的:如果我不指定 strinAsFactors = F,为什么“美国人”会变成“2”?
  • 没有stringAsFactors = F 它返回因子水平而不是实际值——这就是为什么将它作为字符返回正确的文本。
【解决方案2】:

我今天自学了一些 R。我使用了一个函数来获得相同的结果。

#!/usr/bin/env Rscript

# References
# https://stackoverflow.com/questions/1699046/for-each-row-in-an-r-dataframe

responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- cbind(questions,responses)

articles <- c("The","A")
geowords <- c("mountain","liquid area")

common_pattern <- paste( "(?:", paste(articles, "", collapse = "|"), ")", sep = "")
pattern1 <- paste(common_pattern, "([a-z])", sep = "")
pattern2 <- paste(common_pattern, "([A-Z])", sep = "")
geo_pattern <- paste(geowords, collapse = "|")

f <- function (x){ 
  q <- x[1]
  r <- x[2]
  a1 <- gsub (pattern1, "\\1", r)
  if ( grepl(geo_pattern, q)){
    a1 <- gsub (pattern2, "\\1", a1)
  }
  x[1] <- q
  x[2] <- a1
}

apply (df, 1, f)

正在运行;

Rscript stacko.R
[1] "Himalaya"      "The Americans" "bird"          "Pacific ocean"

【讨论】:

    【解决方案3】:

    您可以选择将简单的正则表达式与greplgsub 一起使用,如下所示:

    df <- data.frame(cbind(questions,responses), stringsAsFactors = F) #Changing to data frame, since cbind gives a matrix, stringsAsFactors will prevent to not change the columns to factors
    regx <- paste0(geowords, collapse="|") # The "or" condition between the geowords 
    articlegrep <- paste0(articles, collapse="|") # The "or" condition between the articles
    df$responses <- ifelse(grepl(regx, df$questions)|grepl(paste0("(",articlegrep,")","\\s[a-z]"), df$responses), 
           gsub("\\w+ (.*)","\\1",df$responses),df$responses) #The if condition for which replacement has to happen
    
    > print(df)
                              questions     responses
    #1 The highest mountain in the world      Himalaya
    #2        A cold war serie from 2013 The Americans
    #3       A kiwi which is not a fruit          bird
    #4       Widest liquid area on earth Pacific ocean
    

    【讨论】:

      【解决方案4】:

      为了好玩,这里有一个 tidyverse 解决方案:

      df2 <-
      df %>%
      as.tibble() %>%
        mutate(responses =
              #
              if_else(str_detect(questions, geowords),
                      #
                      str_replace(string = responses,
                                  pattern = regex("\\w+\\b\\s(?=[A-Z])"),
                                  replacement = ""),
                      #
                      str_replace(string = responses,
                                  pattern = regex("\\w+\\b\\s(?=[a-z])"),
                                  replacement = ""))
              )
      

      编辑:没有“第一个单词”正则表达式,灵感来自@Calvin Taylor

      # Define articles
      articles <- c("The", "A")
      
      # Make it a regex alternation
      art_or <- paste0(articles, collapse = "|")
      
      # Before a lowercase / uppercase
      art_upper <- paste0("(?:", art_or, ")", "\\s", "(?=[A-Z])")
      art_lower <- paste0("(?:", art_or, ")", "\\s", "(?=[a-z])")
      
      # Work on df
      df4 <-
        df %>%
        as.tibble() %>%
        mutate(responses =
              if_else(str_detect(questions, geowords),
                      str_replace_all(string = responses,
                                      pattern = regex(art_upper),
                                      replacement = ""),
                      str_replace_all(string = responses,
                                      pattern = regex(art_lower),
                                      replacement = "")
                      )
              )
      

      【讨论】:

      • 顺便说一句,我想知道使用文章列表引用而不是“第一个单词”正则表达式是否更有效。我的观点是这不适用于另一种语言(如法语),其中文章可能会坚持第二个单词(没有空格),例如:“L'inspecteur Clouzot”=> 在这种情况下,“L'”赢了'不被删除,因为第三个单词被认为是第二个......
      • 我通过更改 amrrs 中的代码解决了这个问题:str_split(x,"[' ]+", simple = T),但我不知道如何使用 tidyverse方式...
      • 感谢 meriops,也非常好的解决方案。我认为您只需在正则表达式定义中将“\\s*”替换为“\\s”,如果不是“美国人”中的“A”将被删除......
      • @Tau:我进行了编辑,但使用了“正向前瞻”,即使使用“\s*”(即 0 个或多个空格),“A”也不应该被吃掉。 /跨度>
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2022-01-23
      • 1970-01-01
      • 2015-08-31
      • 2021-06-09
      • 1970-01-01
      • 1970-01-01
      • 2021-08-17
      相关资源
      最近更新 更多