【发布时间】:2020-07-06 21:11:25
【问题描述】:
我想创建词袋模型,但要使用 nltk 包计算 相对频率。我的数据是用 pandas 数据框构建的。
这是我的数据:
text title authors label
0 On Saturday, September 17 at 8:30 pm EST, an e... Another Terrorist Attack in NYC…Why Are we STI... ['View All Posts', 'Leonora Cravotta'] Real
1 Story highlights "This, though, is certain: to... Hillary Clinton on police shootings: 'too many... ['Mj Lee', 'Cnn National Politics Reporter'] Real
2 Critical Counties is a CNN series exploring 11... Critical counties: Wake County, NC, could put ... ['Joyce Tseng', 'Eli Watkins'] Real
3 McCain Criticized Trump for Arpaio’s Pardon… S... NFL Superstar Unleashes 4 Word Bombshell on Re... [] Real
4 Story highlights Obams reaffirms US commitment... Obama in NYC: 'We all have a role to play' in ... ['Kevin Liptak', 'Cnn White House Producer'] Real
5 Obama weighs in on the debate\n\nPresident Bar... Obama weighs in on the debate ['Brianna Ehley', 'Jack Shafer'] Real
我已经尝试将其转换为字符串
import nltk
import numpy as np
import random
import bs4 as bs
import re
data = df.astype(str)
data
但是,当我尝试对单词进行标记时,它会出现这样的错误
corpus = nltk.sent_tokenize(data['text'])
TypeError: expected string or bytes-like object
但它似乎不起作用:(有没有人知道如何标记列 ['text'] 中每一行的句子?
【问题讨论】:
-
data['text']是熊猫系列,而不是字符串。您可能应该尝试使用类似data['token_text'] = data['text'].apply(sent_tokenize)的方法将 nltk 标记化的结果添加到新列中。请参阅stackoverflow.com/questions/44173624/… 以了解可能的重复。 -
我试过了,但我得到了这样的错误 NameError: name 'sent_tokenize' is not defined 即使我已经导入了 nltk 库@Beinje
-
根据 nltk 文档,
sent_tokenize函数是nltk.tokenize模块的一部分。所以你需要用nltk.tokenize.sent_tokenize()替换nltk.sent_tokenize() -
您知道如何通过不创建新列来标记 pandas 数据框中的单词吗?我很困惑..(对不起,我还是 Python 新手)@Beinje