将段落标记为句子，然后在 NLTK 中标记为单词答案

【问题标题】：Tokenize a paragraph into sentence and then into words in NLTK将段落标记为句子，然后在 NLTK 中标记为单词
【发布时间】：2016-10-03 00:12:09
【问题描述】：

我正在尝试将整个段落输入到我的文字处理器中，以先拆分成句子，然后再拆分成单词。

我尝试了以下代码，但它不起作用，

    #text is the paragraph input
    sent_text = sent_tokenize(text)
    tokenized_text = word_tokenize(sent_text.split)
    tagged = nltk.pos_tag(tokenized_text)
    print(tagged)

但是这不起作用并给我错误。那么如何将段落标记为句子，然后是单词？

示例段落：

这东西似乎压倒了这只黑褐色的小狗，让他吃惊，伤到了他的心。他绝望地倒在孩子的脚下。当他重复这一击时，伴随着幼稚句子的警告，他仰面翻身，以一种特殊的方式握住他的爪子。他同时用耳朵和眼睛向孩子祈祷。

**警告：**这只是来自互联网的随机文本，我不拥有上述内容。

【问题讨论】：

你能发一个text的样本吗？
@alvas 它只是任何随机段落。
显示输入，因为根据编码、形状、输入的不同，代码会有所不同。
@alvas 这里是输入，那么应该包括什么样的编码、形状和输入差异？
显示一个实际的示例输入...如果它只是纯英文文本（不是社交媒体，例如 twitter），您可以轻松地做到 [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)] 并且使用 Python3 应该可以解决 utf-8 的大多数问题。但是如果你的输入是不同的编码/格式，你以后会发现更多的问题。

标签： python nltk

【解决方案1】：

import nltk  

textsample ="This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart. He sank down in despair at the child's feet. When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner. At the same time with his ears and his eyes he offered a small prayer to the child."  

sentences = nltk.sent_tokenize(textsample)  
words = nltk.word_tokenize(textsample)  
sentences 
[w for w in words if w.isalpha()]

上面的最后一行将确保输出中只有单词而不是特殊字符语句输出如下

['This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart.',
 "He sank down in despair at the child's feet.",
 'When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner.',
 'At the same time with his ears and his eyes he offered a small prayer to the child.']

去掉特殊字符后输出的文字如下

['This',
 'thing',
 'seemed',
 'to',
 'overpower',
 'and',
 'astonish',
 'the',
 'little',
 'dog',
 'and',
 'wounded',
 'him',
 'to',
 'the',
 'heart',
 'He',
 'sank',
 'down',
 'in',
 'despair',
 'at',
 'the',
 'child',
 'feet',
 'When',
 'the',
 'blow',
 'was',
 'repeated',
 'together',
 'with',
 'an',
 'admonition',
 'in',
 'childish',
 'sentences',
 'he',
 'turned',
 'over',
 'upon',
 'his',
 'back',
 'and',
 'held',
 'his',
 'paws',
 'in',
 'a',
 'peculiar',
 'manner',
 'At',
 'the',
 'same',
 'time',
 'with',
 'his',
 'ears',
 'and',
 'his',
 'eyes',
 'he',
 'offered',
 'a',
 'small',
 'prayer',
 'to',
 'the',
 'child']

【讨论】：

【解决方案2】：

这是一个较短的版本。这将为您提供一个包含每个单独句子以及句子中每个标记的数据结构。我更喜欢 TweetTokenizer 用于杂乱的真实世界语言。句子标记器被认为是不错的，但请注意在此步骤之后不要降低单词大小写，因为它可能会影响检测混乱文本边界的准确性。

from nltk.tokenize import TweetTokenizer, sent_tokenize

tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in 
nltk.sent_tokenize(input_text)]
print(tokens_sentences)

这是输出的样子，我清理了它以便结构突出：

[
['This', 'thing', 'seemed', 'to', 'overpower', 'and', 'astonish', 'the', 'little', 'dark-brown', 'dog', ',', 'and', 'wounded', 'him', 'to', 'the', 'heart', '.'], 
['He', 'sank', 'down', 'in', 'despair', 'at', 'the', "child's", 'feet', '.'], 
['When', 'the', 'blow', 'was', 'repeated', ',', 'together', 'with', 'an', 'admonition', 'in', 'childish', 'sentences', ',', 'he', 'turned', 'over', 'upon', 'his', 'back', ',', 'and', 'held', 'his', 'paws', 'in', 'a', 'peculiar', 'manner', '.'], 
['At', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he', 'offered', 'a', 'small', 'prayer', 'to', 'the', 'child', '.']
]

【讨论】：

感谢您提供有关 TweetTokenizer 的信息！

【解决方案3】：

你可能打算循环遍历sent_text：

import nltk

sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences
# now loop over each sentence and tokenize it separately
for sentence in sent_text:
    tokenized_text = nltk.word_tokenize(sentence)
    tagged = nltk.pos_tag(tokenized_text)
    print(tagged)

【讨论】：

reload(sys); sys.setdefaultencoding('utf8') 是 toxic code。如果是python3，那就太多余了。打印本身取决于用户机器上设置的区域设置。
@Nikhil，不要使用setdefaultencoding hack。提出一个新问题，了解导致您出现编码问题的步骤，您将了解如何在处理 unicode 时指定文件编码。
This 解释了为什么这是一个非常糟糕的主意。
感谢您的警告 :-)
谁知道如何保存token的位置？