读取 Python 中所有可能的顺序子串答案

【问题标题】：Read all possible sequential substrings in Python读取 Python 中所有可能的顺序子串
【发布时间】：2015-01-03 15:43:49
【问题描述】：

如果我有一个字母列表，例如：
word = ['W','I','N','E']
并且需要获取所有可能的子字符串序列，长度不超过 3，例如：
W I N E, WI N E, WI NE, W IN E, WIN E 等
解决此问题的最有效方法是什么？

现在，我有：

word = ['W','I','N','E']
for idx,phon in enumerate(word):
    phon_seq = ""
    for p_len in range(3):
        if idx-p_len >= 0:
            phon_seq = " ".join(word[idx-(p_len):idx+1])
            print(phon_seq)

这只是给我以下，而不是子序列：

W
I
W I
N
I N
W I N
E
N E
I N E

我只是不知道如何创建每个可能的序列。

【问题讨论】：

你需要排列吗？还是只是子字符串？
只是子字符串，因为它们需要是连续的。
您所寻找的不就是“WINE”，其中包含所有可能的空间位置吗？
@Reut - 输出会很长。问题中有一些例子
IIUC，你想要 this 这样的东西，但至少有一个拆分，所以你不会得到 "WINE" 作为输出。对吗？

标签： python

【解决方案1】：

试试这个递归算法：

def segment(word):
  def sub(w):
    if len(w) == 0:
      yield []
    for i in xrange(1, min(4, len(w) + 1)):
      for s in sub(w[i:]):
        yield [''.join(w[:i])] + s
  return list(sub(word))

# And if you want a list of strings:
def str_segment(word):
  return [' '.join(w) for w in segment(word)]

输出：

>>> segment(word)
[['W', 'I', 'N', 'E'], ['W', 'I', 'NE'], ['W', 'IN', 'E'], ['W', 'INE'], ['WI', 'N', 'E'], ['WI', 'NE'], ['WIN', 'E']]

>>> str_segment(word)
['W I N E', 'W I NE', 'W IN E', 'W INE', 'WI N E', 'WI NE', 'WIN E']

【讨论】：

@Adam_G 我改进了答案——你应该改用这个版本！
以前的版本对segment而不是sub进行了不必要的递归调用——几乎是一个错字。

【解决方案2】：

由于在三个位置（W 之后、I 之后和 N 之后）中的每一个都可以有一个空格或没有空格，您可以认为这类似于二进制表示中的位为 1 或 0，范围为1 到 2^3 - 1.

input_word = "WINE"
for variation_number in xrange(1, 2 ** (len(input_word) - 1)):  
    output = ''
    for position, letter in enumerate(input_word):
        output += letter
        if variation_number >> position & 1:
            output += ' '
    print output

编辑：要仅包括具有 3 个或更少字符的序列的变体（在一般情况下，input_word 可能长于 4 个字符），我们可以排除二进制表示连续包含 3 个零的情况。（我们还从较大的数字开始范围，以排除开头为 000 的情况。）

for variation_number in xrange(2 ** (len(input_word) - 4), 2 ** (len(input_word) - 1)):  
    if not '000' in bin(variation_number):
        output = ''
        for position, letter in enumerate(input_word):
            output += letter
            if variation_number >> position & 1:
                output += ' '
        print output

【讨论】：

不幸的是，该算法不能推广到更长的输入单词，因为它会打印长度为 4 及以上的子字符串。试试input_word = "SWINE" 看看我的意思。
@irrelephant 它适用于我的 SWINE。不确定子字符串长度是什么意思。
当我尝试它时，我得到了SWIN E 这一行——SWIN 的长度不是 3 或更短。 OP 的部分问题将子字符串的最大长度限制为 3。
对我帖子的评论让我觉得他并不是真的意味着代码应该限制子字符串的长度。这是我最喜欢的解决方案，因为它很好地抽象了问题。
@irrelephant 好的，我明白你的意思了。根据我与 cmets 中 OP 的交流，它概括列出 input_word 的所有子字符串，无论长度如何。

【解决方案3】：

我对这个问题的实现。

#!/usr/bin/env python

# this is a problem of fitting partitions in the word
# we'll use itertools to generate these partitions
import itertools

word = 'WINE'

# this loop generates all possible partitions COUNTS (up to word length)
for partitions_count in range(1, len(word)+1):
    # this loop generates all possible combinations based on count
    for partitions in itertools.combinations(range(1, len(word)), r=partitions_count):

        # because of the way python splits words, we only care about the
        # difference *between* partitions, and not their distance from the
        # word's beginning
        diffs = list(partitions)
        for i in xrange(len(partitions)-1):
            diffs[i+1] -= partitions[i]

        # first, the whole word is up for taking by partitions
        splits = [word]

        # partition the word's remainder (what was not already "taken")
        # with each partition
        for p in diffs:
            remainder = splits.pop()
            splits.append(remainder[:p])
            splits.append(remainder[p:])

        # print the result
        print splits

【讨论】：

我不确定这是否正确。这给出了排列，但我正在寻找所有子字符串。换句话说，所有可能的方式来切分这个词。
这确实给出了所有子字符串并返回与我的答案相同的结果（除了作为列表而不是字符串）。

【解决方案4】：

作为替代答案，您可以使用itertools 模块并使用groupby 函数对您的列表进行分组，我还使用combination 创建一个用于分组键的配对索引列表：(i<=word.index(x)<=j)最后使用set 获取唯一列表。

还请注意，您首先可以通过这种方法获得对索引的唯一组合，当您有像 (i1,j1) and (i2,j2) 这样的对时，如果 i1==0 and j2==3 和 j1==i2 像 (0,2) and (2,3) 这意味着这些切片结果与您相同需要删除其中之一。

多合一列表理解：

subs=[[''.join(i) for i in j] for j in [[list(g) for k,g in groupby(word,lambda x: i<=word.index(x)<=j)] for i,j in list(combinations(range(len(word)),2))]]
set([' '.join(j) for j in subs]) # set(['WIN E', 'W IN E', 'W INE', 'WI NE', 'WINE'])

详细演示：

>>> cl=list(combinations(range(len(word)),2))
>>> cl
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

>>> new_l=[[list(g) for k,g in groupby(word,lambda x: i<=word.index(x)<=j)] for i,j in cl]
>>> new_l
[[['W', 'I'], ['N', 'E']], [['W', 'I', 'N'], ['E']], [['W', 'I', 'N', 'E']], [['W'], ['I', 'N'], ['E']], [['W'], ['I', 'N', 'E']], [['W', 'I'], ['N', 'E']]]
>>> last=[[''.join(i) for i in j] for j in new_l]
>>> last
[['WI', 'NE'], ['WIN', 'E'], ['WINE'], ['W', 'IN', 'E'], ['W', 'INE'], ['WI', 'NE']]
>>> set([' '.join(j) for j in last])
set(['WIN E', 'W IN E', 'W INE', 'WI NE', 'WINE'])
>>> for i in set([' '.join(j) for j in last]):
...  print i
... 
WIN E
W IN E
W INE
WI NE
WINE
>>>

【讨论】：

这不起作用 - 应该有 7 种可能的输出组合。

【解决方案5】：

我认为它可以是这样的：词=“ABCDE” 我的列表 = []

for i in range(1, len(word)+1,1):
    myList.append(word[:i])

    for j in range(len(word[len(word[1:]):]), len(word)-len(word[i:]),1):
        myList.append(word[j:i])

print(myList)
print(sorted(set(myList), key=myList.index))
return myList

【讨论】：

格式化建议：我猜word 和myList 应该在代码块中。另外，您能否解释一下这个答案给出的其他回复中没有的内容？