这是一些执行递归蛮力搜索的代码。它将单词列表放入一个集合中,因此查找速度非常快:下面的示例在我具有 2GB RAM 的旧 2 GHz 机器上运行不到 1 秒。但是,拆分比我使用的示例更长的序列肯定会花费更长的时间,主要是因为有很多可能的组合。要清除无意义的结果,您要么需要手动完成,要么使用可以进行自然语言处理的软件。
#!/usr/bin/env python3
''' Separate words
Use dictionary lookups to recursively split a string into separate words
See http://stackoverflow.com/q/41241216/4014959
Written by PM 2Ring 2016.12.21
'''
# Sowpods wordlist from http://www.3zsoftware.com/download/
fname = 'scrabble_wordlist_sowpods.txt'
allwords = set('AI')
with open(fname) as f:
for w in f:
allwords.add(w.strip())
def parse(data, result=None):
if result is None:
result = []
if data in allwords:
result.append(data)
yield result[::-1]
else:
for i in range(1, len(data)):
first, last = data[:i], data[i:]
if last in allwords:
yield from parse(first, result + [last])
# Test
data = (
'HELLOHOWAREYOU',
'THISEXAMPLEWORKSWELL',
'ISTHEREAFASTWAY',
'ONE',
'TWOWORDS',
)
for s in data:
print(s)
for u in parse(s):
print(u)
print('')
输出
HELLOHOWAREYOU
['HELL', 'OHO', 'WARE', 'YOU']
['HELLO', 'HO', 'WARE', 'YOU']
['HELLO', 'HOW', 'ARE', 'YOU']
['HELL', 'OH', 'OW', 'ARE', 'YOU']
['HELLO', 'HOW', 'A', 'RE', 'YOU']
['HELL', 'OH', 'OW', 'A', 'RE', 'YOU']
THISEXAMPLEWORKSWELL
['THIS', 'EXAMPLE', 'WORK', 'SWELL']
['THIS', 'EX', 'AMPLE', 'WORK', 'SWELL']
['THIS', 'EXAMPLE', 'WORKS', 'WELL']
['THIS', 'EX', 'AMPLE', 'WORKS', 'WELL']
ISTHEREAFASTWAY
['I', 'ST', 'HER', 'EA', 'FAS', 'TWAY']
['IS', 'THERE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HERE', 'A', 'FAS', 'TWAY']
['IS', 'THE', 'RE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HE', 'RE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HER', 'EA', 'FAST', 'WAY']
['IS', 'THERE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HERE', 'A', 'FAST', 'WAY']
['IS', 'THE', 'RE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HE', 'RE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HER', 'EA', 'FA', 'ST', 'WAY']
['IS', 'THERE', 'A', 'FA', 'ST', 'WAY']
['I', 'ST', 'HERE', 'A', 'FA', 'ST', 'WAY']
['IS', 'THE', 'RE', 'A', 'FA', 'ST', 'WAY']
['I', 'ST', 'HE', 'RE', 'A', 'FA', 'ST', 'WAY']
ONE
['ONE']
TWOWORDS
['TWO', 'WORDS']
此代码是为 Python 3 编写的,但您可以通过更改使其在 Python 2 上运行
yield from parse(first, result + [last])
到
for seq in parse(first, result + [last]):
yield seq
顺便说一句,我们可以按长度对输出列表进行排序,即每个列表中的单词数。这往往会将更合理的结果放在顶部附近。
for s in data:
print(s)
for u in sorted(parse(s), key=len):
print(u)
print('')