将“CategorizedPlaintextCorpusReader”转换为数据框答案

【问题标题】：Converting 'CategorizedPlaintextCorpusReader' into dataframe将“CategorizedPlaintextCorpusReader”转换为数据框
【发布时间】：2018-02-16 22:45:54
【问题描述】：

我想将 movie_reviews 数据集从 nltk.corpus 转换为数据框。目的是使用这些数据进行情绪分析。使用 pandas 转换数据时，出现错误：

    from nltk.corpus import movie_reviews
    import pandas as pd

    mr=movie_reviews
    movie=pd.DataFrame(mr)

ValueError: DataFrame 构造函数未正确调用！

【问题讨论】：

@alvas，既然您已经展示了如何做到这一点，也许您现在应该删除您的“不可能”声明...
啊，应该是“我认为不可能简单地这样初始化”=)
我认为不可能以这种方式简单地初始化它”。NLTK 的 CategorizedPlaintextCorpusReader 对象不是 dtype 的 pandas。

标签： python python-3.x pandas nltk

【解决方案1】：

NLTK 的 CategorizedPlaintextCorpusReader 对象不是 pandas 的 dtype。

话虽如此，您可以将电影评论转换为元组列表，然后像这样填充数据框：

import pandas as pd

from nltk.corpus import movie_reviews as mr

reviews = []
for fileid in mr.fileids():
    tag, filename = fileid.split('/')
    reviews.append((filename, tag, mr.raw(fileid)))

df = pd.DataFrame(reviews, columns=['filename', 'tag', 'text'])

[出]：

>>> df.head()
          filename  tag                                               text
0  cv000_29416.txt  neg  plot : two teen couples go to a church party ,...
1  cv001_19502.txt  neg  the happy bastard's quick movie review \ndamn ...
2  cv002_17424.txt  neg  it is movies like these that make a jaded movi...
3  cv003_12683.txt  neg   " quest for camelot " is warner bros . ' firs...
4  cv004_12641.txt  neg  synopsis : a mentally unstable man undergoing ...

要处理text 列，请参阅How to NLTK word_tokenize to a Pandas dataframe for Twitter data?

【讨论】：

【解决方案2】：

试试这个简化的答案：

from nltk.corpus import reuters # Imports Reuters corpus
reuters_cat= reuters.categories() # Creates a list of categories

docs=[] 
for cat in reuters_cat: # We append tuples of each document and categories in a list
    t1=reuters.sents(categories=cat) # At each iteration we retrieve all documents of a given category
    for doc in t1:
        docs.append((' '.join(doc), cat)) # These documents are appended as a tuple (document, category) in the list

reuters_df=pd.DataFrame(docs, columns=['document', 'category']) #The data frame is created using the generated tuple.

reuters_df.head()

很抱歉没有添加数据帧头示例，因为我还是 stackoverflow 的新手

【讨论】：