使用 R 计算大数据集中每一行的情绪答案

【问题标题】：Calculate sentiment of each row in a big dataset using R使用 R 计算大数据集中每一行的情绪
【发布时间】：2020-08-22 07:54:39
【问题描述】：

我在计算相对较大数据集 (N=36140) 中每一行的平均情绪时遇到问题。我的数据集包含来自 Google Play 商店应用程序的评论数据（每行代表一条评论），我想使用 sentiment_by() 函数计算每条评论的情绪。问题是这个函数需要很长时间来计算。

这是我的 .csv 格式数据集的链接：

https://drive.google.com/drive/folders/1JdMOGeN3AtfiEgXEu0rAP3XIe3Kc369O?usp=sharing

我已尝试使用此代码：

library(sentimentr)
e_data = read.csv("15_06_2016-15_06_2020__Sygic.csv", stringsAsFactors = FALSE)
sentiment=sentiment_by(e_data$review)

然后我收到以下警告消息（在 10 多分钟后取消该过程后）：

Warning message:
Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
raw `character` vector is passed to `text.var`. This may be costly of time and
memory.  It is highly recommended that the user first runs the raw `character`
vector through the `get_sentences` function.

我也尝试过使用get_sentences()函数和下面的代码，但是sentiment_by()函数仍然需要很多时间来执行计算

e_sentences = e_data$review %>%
  get_sentences() 
e_sentiment = sentiment_by(e_sentences)

我有关于 Google Play 商店评论数据的数据集，并且在过去的一个月里我使用了 mood_by() 函数，它在计算情绪时运行得非常快......从昨天开始我才开始运行这么长时间的计算。

有没有一种方法可以快速计算大数据集上每一行的情绪。

【问题讨论】：

标签： r sentiment-analysis sentimentr

【解决方案1】：

sentiment 中使用的算法似乎是 O(N^2) 一旦你获得超过 500 条左右的个人评论，这就是为什么当你显着增加数据集的大小时它会突然花费更长的时间。大概是在以某种方式比较每对评论？

我浏览了帮助文件 (?sentiment)，它似乎没有做任何取决于评论对的事情，所以这有点奇怪。

library(data.table)
reviews <- iconv(e_data$review, "") # I had a problem with UTF-8, you may not need this
x1 <- rbindlist(lapply(reviews[1:10],sentiment_by))
x1[,element_id:=.I]
x2 <- sentiment_by(reviews[1:10])

有效地产生相同的输出，这意味着sentimentr 包中有一个错误，导致它不必要地变慢。

一种解决方案是批量审核。这将破坏sentiment_by 中的“按”功能，但我认为您应该能够在发送它们之前（或之后，因为这似乎并不重要）自己对它们进行分组。

batch_sentiment_by <- function(reviews, batch_size = 200, ...) {
  review_batches <- split(reviews, ceiling(seq_along(reviews)/batch_size))
  x <- rbindlist(lapply(review_batches, sentiment_by, ...))
  x[, element_id := .I]
  x[]
}

batch_sentiment_by(reviews)

在我的机器上大约需要 45 秒（对于更大的数据集，应该是 O(N)。

【讨论】：

工作就像一个魅力。非常感谢你。您的函数用不到 60 秒的时间计算每个包含 30k+ 行的应用的情绪。