过滤掉字符串向量中的子字符串答案

【问题标题】：Filter out substrings in a string vector过滤掉字符串向量中的子字符串
【发布时间】：2018-09-28 17:33:19
【问题描述】：

我有一个这样的字符串向量：

"I love Mangoes." , "I love Mangoes and Apples." , "Apples are good for health" , "I live in America" , "I love Mangoes and Apples and Strawberries." , "Mangoes and Apples." , "Mangoes and Apples and Honey"

我想要一个字符串向量，它将过滤掉输入向量的任何元素的任何完整子字符串匹配。也就是说，结果会是这样的：

"Apples are good for health" , "I live in America" , "I love Mangoes and Apples and Strawberries." , "Mangoes and Apples and Honey"

顺序无关紧要。在这里，前两个条目被删除，因为它们是倒数第三个条目的子字符串。删除倒数第二个条目，因为它也是先前条目的子字符串。

任何帮助将不胜感激。这是我对语料库进行的短语检测的一部分。

【问题讨论】：

标签： python r regex

【解决方案1】：

您可以使用带有边界的grepl 来捕获精确的字符串以匹配您的每个元素。有多个匹配项（一个 = 他们自己）的那些是要丢弃的，即

R - 解决方案

v1 = colSums(sapply(x, function(i) grepl(paste0('\\b', i, '\\b'), x))) <= 1
names(v1)[v1]
#[1] "Apples are good for health"  "I live in America" "I love Mangoes and Apples and Strawberries."
#[4] "Mangoes and Apples and Honey"

Python - 解决方案

import re
from itertools import compress

v2 = []
for i in x:
    i1 = sum([re.search(i, a) is not None for a in x]) == 1
    v2.append(i1)

list(compress(x, v2))
#['Apples are good for health', 'I live in America', 'I love Mangoes and Apples and Strawberries.', 'Mangoes and Apples and Honey']

【讨论】：

略有不同：s[colSums(sapply(s, function(x) grepl(x, setdiff(s, x)))) < 1] (based on this)

【解决方案2】：

你可以这样做...

vec <- c("I love Mangoes." , "I love Mangoes and Apples." , "Apples are good for health" , 
         "I live in America" , "I love Mangoes and Apples and Strawberries." , 
         "Mangoes and Apples." , "Mangoes and Apples and Honey")

vec <- vec[order(nchar(vec))] #sort by string length

vec[!c(sapply(2:length(vec), #iterate from shortest to longest
              function(i) any(grepl(vec[i-1], vec[i:length(vec)]))), #check whether shorter is included in any longer
       FALSE)] #add value for final (longest) entry

[1] "I live in America"                           "Apples are good for health"                 
[3] "Mangoes and Apples and Honey"                "I love Mangoes and Apples and Strawberries."

【讨论】：

【解决方案3】：

我们也可以使用combn枚举所有成对的字符串比较，然后对所有成对的组合使用grepl来删除在其他字符串中匹配的字符串。

df <- as.data.frame(combn(s, 2));
rmv <- unique(unname(unlist(df[1, sapply(df, function(x) grepl(x[1], x[2]))])))
s[!(s %in% rmv)]
#[1] "Apples are good for health"
#[2] "I live in America"
#[3] "I love Mangoes and Apples and Strawberries"
#[4] "Mangoes and Apples and Honey"

样本数据

s <- c(
    "I love Mangoes" ,
    "I love Mangoes and Apples" ,
    "Apples are good for health" ,
    "I live in America" ,
    "I love Mangoes and Apples and Strawberries" ,
    "Mangoes and Apples" ,
    "Mangoes and Apples and Honey")

【讨论】：