如何使用 nltk 库从数据框中拆分句子？答案

【问题标题】：How to split a sentence from dataframes with nltk library?如何使用 nltk 库从数据框中拆分句子？
【发布时间】：2020-07-06 21:11:25
【问题描述】：

我想创建词袋模型，但要使用 nltk 包计算 相对频率。我的数据是用 pandas 数据框构建的。

这是我的数据：

text    title   authors label
0   On Saturday, September 17 at 8:30 pm EST, an e...   Another Terrorist Attack in NYC…Why Are we STI...   ['View All Posts', 'Leonora Cravotta']  Real
1   Story highlights "This, though, is certain: to...   Hillary Clinton on police shootings: 'too many...   ['Mj Lee', 'Cnn National Politics Reporter']    Real
2   Critical Counties is a CNN series exploring 11...   Critical counties: Wake County, NC, could put ...   ['Joyce Tseng', 'Eli Watkins']  Real
3   McCain Criticized Trump for Arpaio’s Pardon… S...   NFL Superstar Unleashes 4 Word Bombshell on Re...   []  Real
4   Story highlights Obams reaffirms US commitment...   Obama in NYC: 'We all have a role to play' in ...   ['Kevin Liptak', 'Cnn White House Producer']    Real
5   Obama weighs in on the debate\n\nPresident Bar...   Obama weighs in on the debate   ['Brianna Ehley', 'Jack Shafer']    Real

我已经尝试将其转换为字符串

import nltk 
import numpy as np
import random
import bs4 as bs
import re

data = df.astype(str)
data

但是，当我尝试对单词进行标记时，它会出现这样的错误

corpus = nltk.sent_tokenize(data['text'])

TypeError: expected string or bytes-like object

但它似乎不起作用:(有没有人知道如何标记列 ['text'] 中每一行的句子？

【问题讨论】：

data['text'] 是熊猫系列，而不是字符串。您可能应该尝试使用类似data['token_text'] = data['text'].apply(sent_tokenize) 的方法将 nltk 标记化的结果添加到新列中。请参阅stackoverflow.com/questions/44173624/… 以了解可能的重复。
我试过了，但我得到了这样的错误 NameError: name 'sent_tokenize' is not defined 即使我已经导入了 nltk 库@Beinje
根据 nltk 文档，sent_tokenize 函数是 nltk.tokenize 模块的一部分。所以你需要用nltk.tokenize.sent_tokenize()替换nltk.sent_tokenize()
您知道如何通过不创建新列来标记 pandas 数据框中的单词吗？我很困惑..（对不起，我还是 Python 新手）@Beinje

标签： python pandas nlp

【解决方案1】：

nltk.tokenize() 要求输入为字符串，您收到错误是因为您直接传递了 pandas.Series 对象：

试试这个用词来标记：

data['Corpus'] = df.text.apply(lambda x: nltk.word_tokenize(x))

对于 sent_tokenize 修改：

data['Sent'] = df.text.apply(lambda x: nltk.sent_tokenize(x))

如果你还想去掉标点符号：

data['no_punc'] = df.text.apply(lambda x: nltk.RegexpTokenizer(r'\w+').tokenize(x))

【讨论】：

你知道如何通过不创建新列来标记来自 pandas 数据帧的单词吗？我很困惑..（对不起，我还是 Python 新手）
只需像这样将其应用到现有列 - data['text'] = df.text.apply(lambda x: nltk.word_tokenize(x))
正如我在 cmets 中所说，nltk.tokenize 是一个包，而不是一个函数，你应该调用 nltk.tokenize.word_tokenize()/nltk.tokenize.sent_tokenize() 而不是 nltk.word_tokenize()/nltk.sent_tokenize()
@Beinje - 是的，这是一种调用方式，但不是唯一的解决方案，您可以直接将其应用于熊猫数据框，如我的回答所示，您可以进一步阅读here
@manojk 不知道，有用但令人困惑（我本来希望它会抛出错误）！