从 nltk 语料库中随机读取句子答案

【问题标题】：Read randomly sentences from nltk corpus从 nltk 语料库中随机读取句子
【发布时间】：2021-07-01 13:30:04
【问题描述】：

我正在做我的大学项目，我必须从 NLTK 语料库 (SemCor) 中随机阅读 50 个句子。

目前我只能阅读前 50 句如下：

from nltk.corpus import semcor as corpus

def get_sentence_from_semcor(sentence_num):
   sentence = " ".join(corpus.sents()[sentence_num])
   tags = corpus.tagged_sents(tag="sem")[sentence_num]
   for curr_word in range(len(tags)):
         if isinstance(tags[curr_word], nltk.Tree) and isinstance(tags[curr_word][0], str) and isinstance(tags[curr_word].label(), nltk.corpus.reader.wordnet.Lemma):
             word = tags[curr_word][0]
             target = tags[curr_word].label().synset()
             sentence_no_word = sentence.replace(word, "")
   return word, sentence_no_word, target

   corpus_sentences = [get_sentence_from_semcor(i) for i in range(50)]

关于如何随机选择语料库中的 50 个句子有任何帮助吗？

【问题讨论】：

在随机模块中调查random.sample。您提供 corpus_sentences 和您想要返回的 random.samples 的数量。

标签： python nltk

【解决方案1】：

你想要随机性，所以让我们导入 random 库：

import random

然后我们需要知道我们的约束是什么。显然，我们可以选择的最早最早的 1 是句子 1，或者索引为 0 的句子，但要知道最大值；我们需要统计句子的数量，然后减1得到最后一个句子的索引。

max_sentence = len(corpus.sents())-1

我们将创建一个空列表来存储我们的[伪]随机数：

list_of_random_indexes = []

然后在其中获取一些数字（在本例中为 50 个）：

for i in range(50):
    list_of_random_indexes.append(random.randint(0, max_sentence))

然后以最后一行的修改版本结束，它现在引用我们的随机数列表而不是范围：

corpus_sentences = [get_sentence_from_semcor(i) for i in list_of_random_indexes]

所以大家一起来：

import random
max_sentence = len(corpus.sents())-1
list_of_random_indexes = []
for i in range(50):
    list_of_random_indexes.append(random.randint(0, max_sentence))
corpus_sentences = [get_sentence_from_semcor(i) for i in list_of_random_indexes]

或者让它更干净一点：

import random
max_sentence = len(corpus.sents())-1
list_of_random_indexes = [random.randint(0, max_sentence) for I in range(50)]
corpus_sentences = [get_sentence_from_semcor(i) for i in list_of_random_indexes]

但由于您可能不希望有重复的行，我也会在添加索引之前检查它不在列表中。

import random
max_sentence = len(corpus.sents())-1
list_of_random_indexes = []
while len(list_of_random_indexes)<50:
    test_index = random.randint(0, max_sentence)
    if test_index not in list_of_random_indexes:
        list_of_random_indexes.append(test_index)
corpus_sentences = [get_sentence_from_semcor(i) for i in list_of_random_indexes]

【讨论】：

【解决方案2】：

你可以试试这样的：

import numpy
length = len(nltk.corpus.semcor.sents())-50
for i in range(n_times):
   start = np.random.randint(0, length)
   corpus_sentences = [get_sentence_from_semcor(i) for i in range(start,(start+50))]

代码将迭代 n_次，每次返回一组 50 个句子。 'start' 是范围（0，长度）中的随机整数。（假设你知道语料库的总长度）。

【讨论】：