字典包含文本文件中的单词作为键，所有下一个单词的列表作为值答案

【问题标题】：Dictionary containing words from a text file as keys with a list of all the next words as values字典包含文本文件中的单词作为键，所有下一个单词的列表作为值
【发布时间】：2017-07-24 22:40:56
【问题描述】：

我无法弄清楚这背后的逻辑。

1.) 首先，我从该文本文件中读取了一个去掉标点符号和空格的文本文件：

The sun is bright & the moon glows.
The dog barks while the cat meows.
My dog is dark, dark as crows.

2.) 读完这个文本文件后，我假设有一个字典，其中一个单词作为键，NEXT 单词作为值，如下所示：

{'the':['sun','moon','dog','cat'], 'sun':['is'], 'is':['bright','dark'], 'moon':['glows'],'glows':['the'], 
 'dog':['barks','is'], 'barks':['while'],'while':['the'], 'cat':['meows'],'meows':['my'], 'my':['dog'], 
 'dark':['dark','as'], 'as':['crows'], 
 'bright':['the'], 'crows':[]}

最后两项是特殊情况。 "crows" 有一个空列表，因为它是文本文件中的最后一个单词。

我不确定这背后的逻辑，但我似乎无法理解这一点。

我的第一种方法是创建一个包含所有单词的巨大列表，然后从列表中挑选和拉取以形成几个较小的列表。

【问题讨论】：

标签： python list file dictionary

【解决方案1】：

您可以使用open 和.read 读取文件：

with open(filename, 'r') as f:
    astr = f.read()

首先你需要标准化你的输入。这意味着替换您要忽略的字符并删除坏字符：

# lowercase the string
astr = astr.lower()

# remove to-be-ignored characters
for badchar in '&,.':
    astr = astr.replace(badchar, '')

下一步是通过空格分割输入然后获取单词和下一个并将其附加到字典中。

result = {}

words = astr.split()
# only iterate until length - 1 because the last word in each 
# sentence has no next word.
for i in range(len(words) - 1):
    result.setdefault(words[i], []).append(words[i+1])
result.setdefault(words[-1], [])

这给出了：

print(result)
{'as': ['crows'],
 'barks': ['while'],
 'bright': ['the'],
 'cat': ['meows'],
 'crows': [],
 'dark': ['dark', 'as'],
 'dog': ['barks', 'is'],
 'glows': ['the'],
 'is': ['bright', 'dark'],
 'meows': ['my'],
 'moon': ['glows'],
 'my': ['dog'],
 'sun': ['is'],
 'the': ['sun', 'moon', 'dog', 'cat'],
 'while': ['the']}

【讨论】：

【解决方案2】：

您可以链接一些字符串转换以摆脱标点符号，然后在转换为小写后拆分字符串（The vs the）。

然后，将单词列表与同一列表的移动副本交错并对其进行迭代。

将值附加到字典元素，因此 key 是当前单词，value 是后面单词的列表。问题是crows 不在列表中。所以手动添加最后一个单词。

from collections import defaultdict
import string

s = "The sun is bright & the moon glows. The dog barks while the cat meows. My dog is dark, dark as crows."
s = s.translate({ord(x):None for x in string.punctuation}).lower().split()

c = defaultdict(list)

for cw,nw in zip(s,s[1:]):
    c[cw].append(nw)

c[s[-1]] = []  # last word of the sentence, special case
print(c)

结果：

defaultdict(<class 'list'>, {'is': ['bright', 'dark'], 'moon': ['glows'],
 'cat': ['meows'], 'glows': ['the'], 'meows': ['my'], 'crows': [], 
'bright': ['the'], 'while': ['the'], 'the': ['sun', 'moon', 'dog','cat'],  
'as': ['crows'], 'dog': ['barks', 'is'], 'sun': ['is'], 'my':     
['dog'], 'dark': ['dark', 'as'], 'barks': ['while']})

【讨论】：

看起来像家庭作业，但这是一个很酷的问题。不确定老师是否相信学生是这样写的:)