【发布时间】:2020-12-29 21:23:39
【问题描述】:
我是定量文本分析的新手,我正在尝试从朴素贝叶斯分类器的输出中提取与特定分类类别相关的关键字。我正在运行以下示例(将电影评论分类为正面或负面)。我想要两个向量,每个向量分别包含与正面和负面类别相关的关键词。我说我应该关注来自 summary() 输出的“估计特征分数”是否正确,如果是,我该如何解释这些?
require(quanteda)
require(quanteda.textmodels)
require(caret)
corp_movies <- data_corpus_moviereviews
summary(corp_movies, 5)
# generate 1500 numbers without replacement
set.seed(300)
id_train <- sample(1:2000, 1500, replace = FALSE)
head(id_train, 10)
# create docvar with ID
corp_movies$id_numeric <- 1:ndoc(corp_movies)
# get training set
dfmat_training <- corpus_subset(corp_movies, id_numeric %in% id_train) %>%
dfm(remove = stopwords("english"), stem = TRUE)
# get test set (documents not in id_train)
dfmat_test <- corpus_subset(corp_movies, !id_numeric %in% id_train) %>%
dfm(remove = stopwords("english"), stem = TRUE)
tmod_nb <- textmodel_nb(dfmat_training, dfmat_training$sentiment)
summary(tmod_nb)
【问题讨论】:
标签: r machine-learning text naivebayes quanteda