【问题标题】:How to read and label line by line a text file using nltk.corpus in Python如何在 Python 中使用 nltk.corpus 逐行读取和标记文本文件
【发布时间】:2014-06-13 06:56:08
【问题描述】:

我的问题是给定两个训练数据good_reviews.txtbad_reviews.txt 对文档进行分类。因此,首先我需要加载和标记我的训练数据,其中每一行都是对应于评论的文档本身。所以我的主要任务是从给定的测试数据中对评论(行)进行分类。

我找到了一种加载和标记名称数据的方法,如下所示:

from nltk.corpus import names
names = ([(name, 'male') for name in names.words('male.txt')] +
         [(name, 'female') for name in names.words('female.txt')])

所以我想要的是一个类似的东西,标签lines 而不是words。 我希望代码是这样的,因为.lines 是无效的语法,所以这当然不起作用:

reviews = ([(review, 'good_review') for review in reviews.lines('good_reviews.txt')] +
           [(review, 'bad_review') for review in reviews.lines('bad_reviews.txt')])

我想要这样的结果:

>>> reviews[0]
('This shampoo is very good blablabla...', 'good_review')

【问题讨论】:

  • 那么,你试过了吗?结果如何?你的代码在哪里,它到底有什么问题?
  • 不,它不起作用,因为.lines 是无效语法,并且未在 nltk.corpus 中定义。

标签: python nltk corpus


【解决方案1】:

如果您正在阅读自己的文本文件,那么与NLTK 没什么关系,您可以简单地使用file.readlines()

good_reviews = """This is great!
Wow, it amazes me...
An hour of show, a lifetime of enlightment
"""
bad_reviews = """Comme si, Comme sa.
I just wasted my foo bar on this.
An hour of s**t, ****.
"""
with open('/tmp/good_reviews.txt', 'w') as fout:
    fout.write(good_reviews)
with open('/tmp/bad_reviews.txt', 'w') as fout:
    fout.write(bad_reviews)

reviews = []
with open('/tmp/good_reviews.txt', 'r') as fingood, open('/tmp/bad_reviews.txt', 'r') as finbad:
    reviews = ([(review, 'good_review') for review in fingood.readlines()] + [(review, 'bad_review') for review in finbad.readlines()])

print reviews

[出]:

[('This is great!\n', 'good_review'), ('Wow, it amazes me...\n', 'good_review'), ('An hour of show, a lifetime of enlightment\n', 'good_review'), ('Comme si, Comme sa.\n', 'bad_review'), ('I just wasted my foo bar on this.\n', 'bad_review'), ('An hour of s**t, ****.\n', 'bad_review')]

如果您要使用 NLTK 电影评论语料库,请参阅 Classification using movie review corpus in NLTK/Python

【讨论】:

  • 这正是我想要的。该链接也非常有帮助。谢谢!
  • 不编辑文本文件是什么意思??
猜你喜欢
  • 2015-05-21
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2012-03-01
  • 2020-11-12
  • 2022-11-05
  • 2014-06-13
  • 2019-04-16
相关资源
最近更新 更多