仅在字符串的结尾部分使用正则表达式答案

【问题标题】：Use regular expressions inside only the end portion of strings仅在字符串的结尾部分使用正则表达式
【发布时间】：2020-08-10 21:12:40
【问题描述】：

我正在预处理一个包含 100,000 多个博客 URL 的数据框，其中许多都包含博客标题中的内容。 grep 函数让我删除了其中许多 URL，因为它们与档案、提要、图像、附件或各种其他原因有关。其中之一是它们包含“原子”。

例如，

string <- "http://www.example.com/2014/05/update-on-atomic-energy-legislation/feed/atom/archives/"
row <- "one" 
df <- data.frame(row, string) 
df$string <- as.character(df$string) df[-grep("atom", string), ]

我的问题是“原子”模式可能出现在博客标题中，这是重要的内容，我不想删除这些 URL。

如何将 grep 仅集中在最后 20 个字符（或某个数字，大大降低我将 grep 包含模式而不是结束元素的内容的风险？这个问题在末尾使用 $ 但不是使用 R；另外，我不知道如何将 $ 向后扩展 20 个字符。Regular Expressions _# at end of string

假设模式并不总是在一端或两端都有正斜杠。例如，/atom/。

substr 函数可以隔离字符串的结尾部分，但我不知道如何仅在该部分内进行 grep。下面的伪代码利用 %in% 函数试图说明我想做什么。

substr(df$string, nchar(df$string)-20, nchar(df$string)) # 提取最后 20 个字符；从 nchar end -20 开始，到结束

但是下一步是什么？

string[-grep(pattern = "atom" %in% (substr(string, nchar(string)-20, nchar(string))), x = string)]

感谢您的指导。

【问题讨论】：

你可以直接搜索“/atom/”吗？
正如我所写，马修，之前或之后并不总是有正斜杠。
不能先过滤掉所有存档吗？

标签： r regex string

【解决方案1】：

lastpart=substr(df$string, nchar(df$string)-20, nchar(df$string))
if(length(grep("atom",lastpart))>0){
  # atom was in there
} else {
  # atom was not in there
}

没有最后一部分也可以这样做..

if(length(grep("atom",substr(df$string, nchar(df$string)-20, nchar(df$string))))>0){
  # atom was in there
} else {
  # atom was not in there
}

但事情变得更难阅读......（虽然提供更好的性能）

【讨论】：

【解决方案2】：

您可以尝试使用 URL 组件深度方法（即只返回 df 行，其中包含 5 个斜杠后包含单词“atom”）：

find_first_match <- function(string, pattern) {
  components <- unlist(strsplit(x = string, split = "/", fixed = TRUE), use.names = FALSE)
  matches <- grepl(pattern = pattern, x = components)
  if(any(matches) == TRUE) {
    first.match <- which.min(matches)
  } else {
    first.match <- NA
  }
  return(first.match)
}

可以如下使用：

# Add index for first component match of "atom" in url
df$first.match <- lapply(df$string, find_first_match, pattern = "atom")

# Return rows which have the word "atom" only after the first 5 components
df[first.match >= 6]

#   row                                                                                 string first.match
# 1 one http://www.example.com/2014/05/update-on-atomic-energy-legislation/feed/atom/archives/           6

这使您可以根据“原子”出现的深度来控制要返回的 URL

【讨论】：

【解决方案3】：

我选择第二个答案是因为它更容易理解，并且因为第一个答案无法预测“组件深度”中包含多少正斜杠。

从内部函数到最广泛的函数的第二个答案翻译成英文说：使用 substr() 函数定义字符串的最后 20 个字符，即您的子字符串；

然后使用grep() 函数查找模式“atom”是否在该子字符串中；

然后计算是否在子字符串中多次找到“atom”，因此length大于零，该行将被省略；

最后，如果没有匹配的模式，即在最后 20 个字符中没有找到“原子”，则保留该行 - 全部由 if…else() 函数完成

【讨论】：