【问题标题】:Converting 'CategorizedPlaintextCorpusReader' into dataframe将“CategorizedPlaintextCorpusReader”转换为数据框
【发布时间】:2018-02-16 22:45:54
【问题描述】:

我想将 movie_reviews 数据集从 nltk.corpus 转换为数据框。 目的是使用这些数据进行情绪分析。 使用 pandas 转换数据时,出现错误:

    from nltk.corpus import movie_reviews
    import pandas as pd

    mr=movie_reviews
    movie=pd.DataFrame(mr)

ValueError: DataFrame 构造函数未正确调用!

【问题讨论】:

  • @alvas,既然您已经展示了如何做到这一点,也许您现在应该删除您的“不可能”声明...
  • 啊,应该是“我认为不可能简单地这样初始化”=)
  • 我认为不可能以这种方式简单地初始化它”。NLTK 的 CategorizedPlaintextCorpusReader 对象不是 dtypepandas

标签: python python-3.x pandas nltk


【解决方案1】:

NLTK 的 CategorizedPlaintextCorpusReader 对象不是 pandasdtype

话虽如此,您可以将电影评论转换为元组列表,然后像这样填充数据框:

import pandas as pd

from nltk.corpus import movie_reviews as mr

reviews = []
for fileid in mr.fileids():
    tag, filename = fileid.split('/')
    reviews.append((filename, tag, mr.raw(fileid)))

df = pd.DataFrame(reviews, columns=['filename', 'tag', 'text'])

[出]:

>>> df.head()
          filename  tag                                               text
0  cv000_29416.txt  neg  plot : two teen couples go to a church party ,...
1  cv001_19502.txt  neg  the happy bastard's quick movie review \ndamn ...
2  cv002_17424.txt  neg  it is movies like these that make a jaded movi...
3  cv003_12683.txt  neg   " quest for camelot " is warner bros . ' firs...
4  cv004_12641.txt  neg  synopsis : a mentally unstable man undergoing ...

要处理text 列,请参阅How to NLTK word_tokenize to a Pandas dataframe for Twitter data?

【讨论】:

    【解决方案2】:

    试试这个简化的答案:

    from nltk.corpus import reuters # Imports Reuters corpus
    reuters_cat= reuters.categories() # Creates a list of categories
    
    docs=[] 
    for cat in reuters_cat: # We append tuples of each document and categories in a list
        t1=reuters.sents(categories=cat) # At each iteration we retrieve all documents of a given category
        for doc in t1:
            docs.append((' '.join(doc), cat)) # These documents are appended as a tuple (document, category) in the list
    
    reuters_df=pd.DataFrame(docs, columns=['document', 'category']) #The data frame is created using the generated tuple.
    
    reuters_df.head()
    

    很抱歉没有添加数据帧头示例,因为我还是 stackoverflow 的新手

    【讨论】:

      猜你喜欢
      • 2020-04-15
      • 2017-12-10
      • 2011-05-16
      • 2022-01-20
      • 2017-12-17
      • 2021-03-25
      • 2018-03-30
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多