【问题标题】:Filter out substrings in a string vector过滤掉字符串向量中的子字符串
【发布时间】:2018-09-28 17:33:19
【问题描述】:

我有一个这样的字符串向量:

"I love Mangoes." , "I love Mangoes and Apples." , "Apples are good for health" , "I live in America" , "I love Mangoes and Apples and Strawberries." , "Mangoes and Apples." , "Mangoes and Apples and Honey"

我想要一个字符串向量,它将过滤掉输入向量的任何元素的任何完整子字符串匹配。 也就是说,结果会是这样的:

"Apples are good for health" , "I live in America" , "I love Mangoes and Apples and Strawberries." , "Mangoes and Apples and Honey"

顺序无关紧要。 在这里,前两个条目被删除,因为它们是倒数第三个条目的子字符串。删除倒数第二个条目,因为它也是先前条目的子字符串。

任何帮助将不胜感激。这是我对语料库进行的短语检测的一部分。

 

【问题讨论】:

    标签: python r regex


    【解决方案1】:

    您可以使用带有边界的grepl 来捕获精确的字符串以匹配您的每个元素。有多个匹配项(一个 = 他们自己)的那些是要丢弃的,即

    R - 解决方案

    v1 = colSums(sapply(x, function(i) grepl(paste0('\\b', i, '\\b'), x))) <= 1
    names(v1)[v1]
    #[1] "Apples are good for health"  "I live in America" "I love Mangoes and Apples and Strawberries."
    #[4] "Mangoes and Apples and Honey" 
    

    Python - 解决方案

    import re
    from itertools import compress
    
    v2 = []
    for i in x:
        i1 = sum([re.search(i, a) is not None for a in x]) == 1
        v2.append(i1)
    
    list(compress(x, v2))
    #['Apples are good for health', 'I live in America', 'I love Mangoes and Apples and Strawberries.', 'Mangoes and Apples and Honey']
    

    【讨论】:

    • 略有不同:s[colSums(sapply(s, function(x) grepl(x, setdiff(s, x)))) &lt; 1] (based on this)
    【解决方案2】:

    你可以这样做...

    vec <- c("I love Mangoes." , "I love Mangoes and Apples." , "Apples are good for health" , 
             "I live in America" , "I love Mangoes and Apples and Strawberries." , 
             "Mangoes and Apples." , "Mangoes and Apples and Honey")
    
    vec <- vec[order(nchar(vec))] #sort by string length
    
    vec[!c(sapply(2:length(vec), #iterate from shortest to longest
                  function(i) any(grepl(vec[i-1], vec[i:length(vec)]))), #check whether shorter is included in any longer
           FALSE)] #add value for final (longest) entry
    
    [1] "I live in America"                           "Apples are good for health"                 
    [3] "Mangoes and Apples and Honey"                "I love Mangoes and Apples and Strawberries."
    

    【讨论】:

      【解决方案3】:

      我们也可以使用combn枚举所有成对的字符串比较,然后对所有成对的组合使用grepl来删除在其他字符串中匹配的字符串。

      df <- as.data.frame(combn(s, 2));
      rmv <- unique(unname(unlist(df[1, sapply(df, function(x) grepl(x[1], x[2]))])))
      s[!(s %in% rmv)]
      #[1] "Apples are good for health"
      #[2] "I live in America"
      #[3] "I love Mangoes and Apples and Strawberries"
      #[4] "Mangoes and Apples and Honey"
      

      样本数据

      s <- c(
          "I love Mangoes" ,
          "I love Mangoes and Apples" ,
          "Apples are good for health" ,
          "I live in America" ,
          "I love Mangoes and Apples and Strawberries" ,
          "Mangoes and Apples" ,
          "Mangoes and Apples and Honey")
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2010-09-06
        相关资源
        最近更新 更多