【发布时间】:2018-05-06 09:43:24
【问题描述】:
我有一个数据集,其中只包含两个有用的列用于训练我的模型,第一个是新闻标题,第二个是新闻类别。
所以,我使用 python 成功运行了以下训练命令:
import re
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
# grab the data
news = pd.read_csv("/Users/helloworld/Downloads/NewsAggregatorDataset/newsCorpora.csv",encoding='latin-1')
news.head()
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
news['TEXT'] = [normalize_text(s) for s in news['TITLE']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
所以我的问题是,如何提供一组新数据(例如,仅新闻标题)并告诉程序使用 python sklearn 命令预测新闻类别?
附:我的训练数据是这样的:
【问题讨论】:
-
您是否尝试过使用属于
MultinomialNB类的predict方法? scikit-learn.org/stable/modules/generated/…。您已经根据标题对其进行了培训,并且输出是类别。要在测试数据上使用朴素贝叶斯,请执行与训练时相同的特征转换,然后将其提交到朴素贝叶斯分类器。 -
@why not you just use: y-predicted = nb.predict(x_test) ???
标签: python scikit-learn naivebayes