合并/丢弃重叠的单词答案

【问题标题】：Merge/Discard overlapping words合并/丢弃重叠的单词
【发布时间】：2013-06-05 11:13:50
【问题描述】：

我想合并相似的字符串（单词）（字符串在其他字符串中）。

 word
 wor 
 words
 wormhole
 hole

会做：

words
wormhole

由于wor 与：word、words、wormhole -wor 重叠；
word 与：words 重叠 -word 被丢弃；
hole 与：wormhole 重叠 - hole 被丢弃；
但words、wormhole 不会重叠 - 所以它们会留下来。
我怎样才能做到这一点？

编辑
我的解决方案是：

while read a
do  
   grep $a FILE | 
   awk 'length > m { m = length; a = $0 } END { print a }'
done < FILE | 
sort -u

但不知道会不会对大数据集造成麻烦。

【问题讨论】：

你的问题是什么？
这很模糊。
这个具体的例子很容易解决：将所有单词的列表按升序排序。如果一个词是下一个词的子串，则丢弃当前词；否则继续。但是，这仅涵盖以彼此开头的单词。对于单词 A 包含单词 B 的一般解决方案，您可能必须采用二次方。
我编辑了我的问题。
@Poe 您可能应该删除 holes 以减少歧义。

标签： python ruby perl bash

【解决方案1】：

对any/all 使用列表推导：

>>> lis = ['word','wor', 'words', 'wormhole']
#all
>>> [x for x in lis if all(x not in y for y in lis if y != x)]
['words', 'wormhole']
#any
>>> [x for x in lis if not any(x in y for y in lis if y != x)]
['words', 'wormhole']

您也可以在这里使用marisa_trie：

>>> import marisa_trie
>>> lis = ['word','wor', 'words', 'wormhole', 'hole', 'holes']
>>> def trie(lis):
        trie = marisa_trie.Trie(lis)
        return [x for x in lis if len(trie.keys(unicode(x))) ==1 ]
... 
>>> trie(lis)
['words', 'wormhole', 'holes']

【讨论】：

【解决方案2】：

阿蒙的建议……

按升序对所有单词的列表进行排序。如果一个词是下一个单词的子串，丢弃当前单词；否则继续。

...排序需要 O(n log n)，我不确定 Ashwini 解决方案的时间复杂度，但它看起来超过 O(n log n)。

我认为这是一个 O(n) 的解决方案...

from collections import defaultdict

words = ['word', 'wor', 'words', 'wormhole']

infinite_defaultdict = lambda: defaultdict(infinite_defaultdict)

mydict = infinite_defaultdict()
for word in words:
    d = mydict
    for char in word:
        d = d[char]

result = []
for word in words:
    d = mydict
    for char in word:
        d = d[char]
    if not d:
        result.append(word)

print result

...打印...

['words', 'wormhole']

更新

但不知道会不会对大数据集造成麻烦。

相比之下，使用来自 /usr/share/dict/words 的 10,000 个单词，这需要大约 70 毫秒的 CPU 时间，而 Ashwini 大约需要 11 秒。

更新 2

好的。最初的问题读起来好像单词只能在开始时重叠，但如果它们可以在任何地方重叠，则此代码将不起作用。我认为任何可以做到这一点的算法都会有 O(n²) 的最坏情况复杂度。

【讨论】：

【解决方案3】：

我理解你的问题

给定一个单词列表，我们想要删除所有作为其他单词子串的单词。

这是一个通用的 Perl 解决方案：

sub weed_out {
  my @out;
  WORD:
  while (my $current = shift) {
    for (@_) {
      # skip $current word if it's a substring of any other word
      next WORD if -1 != index $_, $current;
    }
    push @out, $current;
  }
  return @out;
}

请注意，我们 shift 来自 @_ 参数数组，因此内部循环每次都会变短。

如果我们在执行内部循环时遇到一个单词是$current 单词的子字符串，我们实际上可以通过splice 将其删除：

  WORD:
  while (my $current = shift) {
    for (my $i = 0; ; $i++) {
      last unless $i <= $#_; # loop condition must be here
      # remove the other word if it's a substring of $current
      splice(@_, $i, 1), redo if -1 != index $current, $_[$i];
      # skip $current word if it's a substring of any other word
      next WORD if -1 != index $_[$i], $current;
    }
    push @out, $current;
  }

但我宁愿以“优化”为基准。

如果需要，这可以很容易地嵌入到 shell 脚本中：

$ perl - <<'END' FILE
my @words = <>;
chomp(@words);
WORD: while (my $current = shift @words) {
  for (@words) {
    # skip $current word if it's a substring of any other word
    next WORD if -1 != index $_, $current;
  }
  print "$current\n";
}
END

【讨论】：

您的第一个解决方案在 ("AB", "A") 上失败；第二个生成警告，不确定您要在其中做什么。

【解决方案4】：

在 Ruby 中：

list = %w[word wor words wormhole]

list.uniq
.tap{|a| a.reverse_each{|e| a.delete(e) if (a - [e]).any?{|x| x.include?(e)}}}

【讨论】：

【解决方案5】：

在我看来，将单词从最长到最短排序，然后我们只能遍历排序列表一次，仅匹配保留的单词。我不擅长算法分析，但这对我来说很有意义，我认为性能会很好。假设保留单词的顺序无关紧要，它似乎也有效：

words = ['word', 'wor', 'words', 'wormhole', 'hole']
keepers = []

words.sort_by(&:length).reverse.each do |word|
  keepers.push(word) if ! keepers.any?{|keeper| keeper.include?(word)}
end

keepers
# => ["wormhole", "words"]

如果保留单词的顺序确实很重要，那么很容易修改它来解决这个问题。一种选择是：

words & keepers
# => ["words", "wormhole"]

【讨论】：

【解决方案6】：

您可以使用哈希来计算单词列表的子字符串：

use strict;
use warnings;
use feature 'say';

my %seen;                   # seen substrings
my @words;                  # original list
while (<DATA>) {            # read a new substring
    chomp;
    push @words, $_;        # store the original
    while (length) {        # while a substring remains
            $seen{$_}++;    # increase its counter
            chop;           # shorten the substring
    }
}

# All original words with count == 1 are the merged list
my @merged = grep $seen{$_} == 1, @words;

say for @merged;

__DATA__
w
word
wor
words
wormhole
hole
holes

输出：

words
wormhole
holes

当然，您需要对大小写、标点和空格进行补偿，因为哈希键是精确的，并且键 Foo 与键 foo 不同。

【讨论】：

【解决方案7】：

如果单词列表足够长，单词的任何嵌套循环都会非常缓慢。我就是这样做的：

use strict;
use warnings;

use File::Slurp 'read_file';
chomp( my @words = read_file('/usr/share/dict/words') );

my %overlapped;
for my $word (@words) {
    $word =~ /(.*)(?{++$overlapped{$1}})(*FAIL)/;
    --$overlapped{$word};
}

print "$_\n" for grep ! $overlapped{$_}, @words;

Darshan Computing 提出的从最长到最短处理单词的建议也许可以改进它。

【讨论】：

【解决方案8】：

使用awk:

awk '
NR==FNR {
    a[$1]++
    next
} 
{
    for (x in a) { 
        if (index ($1,x) == 0) { 
            a[x] 
        } 
        else { 
            delete a[x]
            a[$1] 
        } 
    }
}
END {
    for (x in a) {
        print x 
    }
}' inputFile inputFile

测试：

inputFile of:

word
wormholes
wor
words
wormhole
hole

Returns:

words
wormholes

【讨论】：

【解决方案9】：

bash 解决方案：

#!/bin/bash
dict="word wor words wormhole hole "
uniq=()

sort_by_length() {
    for word; do
        printf "%d %s\n" ${#word} "$word"
    done | sort -n | cut -d " " -f2-
}
set -- $(sort_by_length $dict)

while [[ $# -gt 0 ]]; do
    word=$1
    shift
    found=false
    for w;  do
        if [[ $w == *"$word"* ]]; then
            found=true
            break
        fi
    done
    if ! $found; then
        uniq+=($word)
    fi
done

echo "${uniq[@]}"

【讨论】：

【解决方案10】：

冗长的 perl oneliner，

perl -nE 'chomp;($l,$p)=($_,0); @w=grep{ $p=1 if /$l/; $p|| $l!~/$_/} @w; $p or push @w,$l}{say for @w' file

【讨论】：