nltk 在生成三元组时不插入句尾符号答案

【问题标题】：nltk doesn't insert end of sentence symbols while generating trigramsnltk 在生成三元组时不插入句尾符号
【发布时间】：2021-01-27 10:25:52
【问题描述】：

我正在使用 Kneser-Ney 平滑从 Hobbit 生成文本。我的模型正在生成句子，但我相信还有改进的余地。

目前，我没有使用符号来标记句子的开头和结尾。当我尝试使用下面的代码插入它们时，我可以看到只有句子符号的第一个开头存在，但不知何故，对于其余的句子，符号没有插入。就好像它根本没有检测到句子的结尾一样。

我尝试不将文本转换为小写，但它没有改变任何内容。

您能告诉我如何插入句尾符号吗？

with open ("hobbit.txt") as f:
     hobbit_text = f.read()

hobbit_text = word_tokenize(hobbit_text.lower())

stop_words = stopwords.words('english')
personal_names = ['legolas', 'gimli', 'boromir', 'frodo', 'thorin', 'thror', 'gandalf', 'smeagol', 'gollum', 'balin', 'elrond','aragorn','bilbo', 'sauron']
signs = ['”','“', '!', '?', '’', '`', "'", '``', ',', ";", "(", ")"]

use_stop_words = True
use_punctuation = False
# get rid of stop words, punctuation (if necessary)
if not use_stop_words:
   hobbit_text = [x for x in hobbit_text if x not in stop_words]
if not use_punctuation:
   hobbit_text = [x for x in hobbit_text if x not in signs]

vocab = set(hobbit_text)

counter = 0
hobbit_trigram = ngrams(hobbit_text, 3, pad_left=True, pad_right=True, left_pad_symbol='BOS', right_pad_symbol='EOS')

for a in hobbit_trigram:
   print(a)
   counter += 1
   if counter == 100:
      break

第一句的输出如下所示。我期待“gold”一词之后的句尾符号。

('BOS', 'BOS', 'the')
('BOS', 'the', '霍比特人')
('the', 'hobbit', 'or')
（“霍比特人”、“或”、“那里”）
('or', 'there', 'and')
('那里', '和', '回来')
('and', 'back', 'again')
（'返回'，'再次'，'j.r.r'）
（'再次'，'j.r.r'，'.'）
('j.r.r', '.', '托尔金')
('.', '托尔金', 'the')
（'托尔金'，'the'，'霍比特人'）
('the', 'hobbit', 'is')
('霍比特人', '是', 'a')
（“是”、“一个”、“故事”）
('a', 'tale', 'of')
（“故事”、“之”、“高”）
('of', 'high', '冒险')
（'高'，'冒险'，'承担'）
（'冒险'，'承担'，'由'）
('承担', '由', 'a')
('by', 'a', '公司')
('a', '公司', 'of')
（“公司”、“之”、“矮人”）
('of', '矮人', 'in')
（“矮人”、“在”、“搜索”）
('in', 'search', 'of')
('搜索', '之', '护龙')
('of', '护龙', '金')
('龙守', '金', '.')
('gold', '.', 'a')

【问题讨论】：

标签： python n-gram trigram

【解决方案1】：

尝试如下方式：

from functools import partial
from nltk import ngrams

padded_ngrams = partial(ngrams, pad_left=True, pad_right=True, left_pad_symbol='BOS', right_pad_symbol='EOS')

padded_hobbit_text = list(padded_ngrams(hobbit_text, 3))

# now print your value to see if it's what you want
print(padded_hobbit_text)

# with an input of "TEXT", it gave me the following output
'''
[('BOS', 'BOS', 'T'),
 ('BOS', 'T', 'E'),
 ('T', 'E', 'X'),
 ('E', 'X', 'T'),
 ('X', 'T', 'EOS'),
 ('T', 'EOS', 'EOS')]
'''

我试过这样做，它给了我方便的格式，就像你在问题中提出的那样。

【讨论】：

这对我不起作用。它仍然没有检测到任何句子的结尾。
我想知道我的所有句子是否都需要放在单独的列表中才能正常工作。我会试试这个。
让我知道它是否有效，如果您的数据集在某处在线可用，那么我也可以自己尝试。
那也没用。最后，我使用正则表达式将句号替换为 EOS 和 BOS。我没有在句子的开头和结尾分别插入 2 个，而是每个插入一个。最终结果是好的。