克服 Ashton String 任务中的 MemoryError / Slow Runtime答案

【问题标题】：Overcoming MemoryError / Slow Runtime in Ashton String task克服 Ashton String 任务中的 MemoryError / Slow Runtime
【发布时间】：2016-04-04 12:33:21
【问题描述】：

在Ashton String task中，目标是：

将给定字符串的所有不同子字符串排列在字典顺序并将它们连接起来。打印第 K 个字符连接的字符串。可以保证，给定的 K 值将是有效，即会有第 K 个字符。

Input Format:

第一行将包含一个数字 T，即测试用例的数量。第一的每个测试用例的行将包含一个包含字符的字符串 (a−z) 和第二行将包含一个数字 K。

Output Format:

打印第K个字符（字符串为1索引）

Constraints 是

1 ≤ T ≤ 5
1≤长度≤105
K 将是一个适当的整数。

例如，给定输入：

1
dbac
3

输出将是：c

我已经用这段代码尝试过这个任务，它适用于相对较短的字符串：

from itertools import chain

def zipngram(text, n=2):
    words = list(text)
    return zip(*[words[i:] for i in range(n)])

for _ in input():
    t = input()
    position = int(input())-1 # 0th indexing
    chargrams = chain(*[zipngram(t,i) for i in range(1,len(t)+1)])
    concatstr = ''.join(sorted([''.join(s) for s in chargrams]))
    print (concatstr[position])

但是如果输入文件看起来像这样：http://pastebin.com/raw/WEi2p09H 并且想要的输出是：

l
s
y
h
s

解释器会抛出一个MemoryError:

Traceback (most recent call last):
  File "solution.py", line 11, in <module>
    chargrams = chain(*[zipngram(t,i) for i in range(1,len(t)+1)])
  File "solution.py", line 11, in <listcomp>
    chargrams = chain(*[zipngram(t,i) for i in range(1,len(t)+1)])
  File "solution.py", line 6, in zipngram
    return zip(*[words[i:] for i in range(n)])
  File "solution.py", line 6, in <listcomp>
    return zip(*[words[i:] for i in range(n)])
MemoryError

MemoryError 如何解决？是否可以使用本机 python2 或 python3 以另一种方式解决？

我尝试通过使用heapq 修剪堆栈来解决MemoryError，但现在它进入了超慢运行时推送和弹出堆而不是占用太多内存。

from itertools import chain
import heapq

t = int(input())
s = list(input())
k = int(input())

stack = []
for n in range(1,len(s)+1):
    for j in range(n):
        ngram = (''.join(s[j:]))
        ngram_len = len(ngram)
        # Push the ngram into the heapq.
        heapq.heappush(stack, ngram)
        pruned_stack = []
        temp_k = 0
        # Prune stack.
        while stack != [] and temp_k < k:
            x = heapq.heappop(stack)
            pruned_stack.append(x)
            temp_k+=len(x)
        # Replace stack with pruend_stack.
        stack = pruned_stack

print (''.join(chain(*pruned_stack))[k])

有没有办法在不使用导致MemoryError 的过多内存和heapq 推送和弹出的运行时间太慢之间取得平衡？

【问题讨论】：

标签： python string out-of-memory n-gram

【解决方案1】：

试试这个代码，它适用于大样本。

def ashton(string, k):
    #We need all the substrings, and they have to be sorted
    sortedSubstrings = sorted_substrings(string)
    count = 0
    currentSubstring = 0
    #Loop through the substrings, until we reach the kth character
    while (count < k):
        substringLen = len(sortedSubstrings[currentSubstring])
        #add the number of characters of the substring to our counter
        count += substringLen
        #advance the current substring by one
        currentSubstring += 1
    #We have the correct substring now, and calculate to get the right char
    diff = count - k
    #Return answer, index 1 = substring, index 2 = char in substring
    return(sortedSubstrings[currentSubstring][substringLen-diff-1])

#Determine the substrings in correct order
#Input: 'dbac', returns: a, ac, b, ba, bac, c, d, db, dba, dbac
def sorted_substrings(string):
    a = set()
    length = len(string)
    #loop through the string to get the substrings
    for i in range(length):
        for j in range(i + 1, length + 1):
            #add each substring to our set
            a.add(string[i:j]) 
    #we need the set to be sorted
    a = sorted(a)
    return a

t = int(input())
for i in range(t):
    s = input()
    k = int(input())
    print(ashton(s, k))

【讨论】：

你能试试这个输入：pastebin.com/raw/WEi2p09H吗？它也会出现 MemoryError 吗？
@alvas 请尝试我的代码，它不会出现内存错误并返回正确的结果
耐心，你必须有。来吧，他们会的，投票临近，赏金就是。
另外，通过展开循环对sortedSubstrings = sorted(set([string[x:y] for x in range(length) for y in range(length) if string[x:y]])) 进行一点解释会让您轻松获得投票 =)
@alvas，我现在将那条复杂的行重写为它自己的函数，这使它更容易阅读。 sorted_substrings 函数将所有子字符串按字典顺序排列。所以'dbac'的函数返回一个集合：a，ac，b...一旦我们有了排序的子字符串，while循环就会检查k，当我们查看每个子字符串时递增。因此，在 k=3 的简单测试用例中，我们首先查看将 count 增加 1 的 'a'。然后将 count 增加为 3 的 'ac'。现在 count 等于 k，我们退出循环。

【解决方案2】：

MemoryError 表示程序消耗了所有可用内存并因此崩溃。

一种可能的解决方案是使用惰性的可迭代对象（它们也可以在 Py2 中工作，但 Py3 对它们有更好的支持）（它们仅根据需要计算值，而不是一次全部计算）。

使您的程序适应生成器只需要进行微小的更改，即可在不使用列表的情况下索引生成器（这会抵消惰性的好处）请参阅：Get the nth item of a generator in Python

【讨论】：