如果您想要一个纯迭代器解决方案来处理具有恒定内存使用的大字符串:
input = "the quick brown fox"
input_iter1 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))
input_iter2 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))
next(input_iter2) # skip first
output = itertools.starmap(
lambda a, b: f"{a} {b}",
zip(input_iter1, input_iter2)
)
list(output)
# ['the quick', 'quick brown', 'brown fox']
如果您有额外的 3 倍字符串内存来将 split() 和加倍输出存储为列表,那么不使用 itertools 可能会更快更容易:
inputs = "the quick brown fox".split(' ')
output = [ f"{inputs[i]} {inputs[i+1]}" for i in range(len(inputs)-1) ]
# ['the quick', 'quick brown', 'brown fox']
更新
支持任意 ngram 大小的通用解决方案:
from typing import Iterable
import itertools
def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
input_iters = [
map(lambda m: m.group(0), re.finditer(token_regex, input))
for n in range(ngram_size)
]
# Skip first words
for n in range(1, ngram_size): list(map(next, input_iters[n:]))
output_iter = itertools.starmap(
lambda *args: " ".join(args),
zip(*input_iters)
)
return output_iter
测试:
input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))
输出:
['If you want a pure',
'you want a pure iterator',
'want a pure iterator solution',
'a pure iterator solution for',
'pure iterator solution for large',
'iterator solution for large strings',
'solution for large strings with',
'for large strings with constant',
'large strings with constant memory',
'strings with constant memory usage']
你也可以找到这个相关的问题:n-grams in python, four, five, six grams?