试图让随机森林运行文本分类答案

【问题标题】：Trying to get random forest for text classification running试图让随机森林运行文本分类
【发布时间】：2018-07-03 17:34:14
【问题描述】：

我正在尝试为学校项目运行 randomForest。我正在尝试构建一个测试分类器，它根据一些文本预测一个类别（列标签）。

目前我被困住了，因为我的文档术语矩阵似乎有问题。这是错误：

> rfmodel <- randomForest(df$label, data = events_dtm)
Error in if (n == 0) stop("data (x) has 0 rows") : 
  argument is of length zero

这就是代码当前的样子。数据具有代表性。

library(tidyverse)
library(tidytext)
library(stringr)
library(caret)
library(tm)
library(dplyr)
library(randomForest)

text = c("this is a random text",
         "another rnd text",
         "hi there",
         "not so rnd",
         "what's that?",
         "kinda boring",
         "this is a random text",
         "another rnd text",
         "hi there",
         "not so rnd",
         "what's that?",
         "kinda boring")

label = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2)

df <- data.frame(text= text, label=label)
df$label <- as.factor(df$label)
df$text <- as.character(df$text)

df$ID <- seq.int(nrow(df))

df <- df[1:5,]

as_tibble(df) %>%
  mutate(text = as.character(text)) -> type

data("stop_words")
type %>%
  unnest_tokens(output = word, input = text) %>%
  anti_join(stop_words) %>%
  mutate(word = SnowballC::wordStem(word)) -> type_tokens


type_tokens %>%
  count(ID, word) %>%
  cast_dtm(document = ID, term = word, value = n,
           weighting = weightTfIdf) -> type_dtm


print(type_dtm)

rfmodel <- randomForest(df$label, data = type_dtm)

print(rfmodel)

dfT <- data.frame(text= text)
dfT$ID <- seq.int(nrow(dfT))

as_tibble(dfT) %>%
  mutate(text = as.character(text)) -> typeT

typeT %>%
  unnest_tokens(output = word, input = text) -> typeT

typeT %>%
  count(ID, word) %>%
  cast_dtm(document = ID, term = word, value = n,
           weighting = weightTfIdf) -> typeT

pred_test <- predict(rfmodel, newdata = dfT, type = "class")

print(pred_test)

由于我对随机森林和 R 都很陌生，因此可能存在概念上的错误。知道如何解决这个问题吗？

【问题讨论】：

标签： r text-mining random-forest

【解决方案1】：

您的代码有几个问题：

首先你的 randomForest 调用： rfmodel <- randomForest(df$label, data = type_dtm)

您不能调用 df$label 并指定标签不存在的数据 type_dtm。其次，随机森林不接受稀疏矩阵。你需要做点什么。您可以通过将标签信息与 type_dtm 合并来解决这个问题。搜索如何做到这一点。第三，你告诉 randomForest y = label，但要么你需要给一个公式接口，比如 label ~ 。并指定 data = .... 或者您需要将 y 和 x 指定为 y = label 和 x = ... 有关详细信息，请参阅 ?randomForest。

所有这些问题加在一起会导致您收到此错误。开始一个一个地解决它们，当你再次陷入困境时，发布一个问题。您的代码是创建可重现示例的良好开端，因此为此努力 +1。

【讨论】：