如何对从 nltk.corpus.gutenberg.fileids() 导入的书籍进行章节分析答案

【问题标题】：How to do chapter analysis from books imported from nltk.corpus.gutenberg.fileids()如何对从 nltk.corpus.gutenberg.fileids() 导入的书籍进行章节分析
【发布时间】：2022-01-06 08:19:12
【问题描述】：

我是一个使用 python 的新手。现在我正在为一本小说做自然语言处理，我选择从 nltk.corpus.gutenberg.fileids() 加载这本书。我只是使用“理智与情感”。然后我想分析每一章。如何将整本书分成几部分？我注意到以这种方式加载的书籍具有独特的格式。不像txt格式。

import nltk
nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids()

当我打印这本书时，它显示： ['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', ...]

sense = nltk.Text(nltk.corpus.gutenberg.words('austen-sense.txt'))
print(sense)

那么这里是另一种格式：我不知道是什么意思。

如果我使用另一个 .txt 书源，我也不知道如何拆分章节。我已将书上传到文件夹中，然后：

text = 'senseText.txt'

【问题讨论】：

标签： python nlp format nltk wordpress-gutenberg

【解决方案1】：

它不像txt格式。

如果您想要更接近全文的内容，请尝试：

raw = nltk.Text(nltk.corpus.gutenberg.raw('austen-sense.txt'))

如果你想要单个句子，你可以使用：

sentences = nltk.Text(nltk.corpus.gutenberg.sents('austen-sense.txt'))

Gutenberg 不会为您按章节分解文本。（许多原始来源没有以章节开头。）如果您的特定文本碰巧在原始文本中包含分节符，您可以尝试搜索这些，但它是特定于文本的。

【讨论】：