R编写函数以获取数据框中的一元组答案

【问题标题】：R Write function to get unigrams in dataframeR编写函数以获取数据框中的一元组
【发布时间】：2023-04-10 14:04:01
【问题描述】：

我想编写一个函数来获取 unigrams（一个单词）的数量。但是，我当前的功能无法按我想要的方式工作。
这是我的函数和示例数据集：

library(ngrams)
library(tidyverse)

#dataframe
df<-tribble(~text,
            "This sentence",
            "I am going to luch",
            "This is a really nice and sunny day")

#function
get_unigrams <- function(text) {
  
  unigram<-  ngram(text, n = 1) %>% get.ngrams() %>% length()

  return(unigram)
}

但是，使用“mutate”函数的计算给了我一个非常奇怪的结果：

df %>% mutate(n=get_unigrams((text)))

# A tibble: 3 x 2
  text                                    n
  <chr>                               <int>
1 This sentence                          14
2 I am going to luch                     14
3 This is a really nice and sunny day    14

每个句子的长度都是相等的。我认为这是因为所有三行文本放在一起并被视为一个文本。
但是，我想得到这样的结果：

# A tibble: 3 x 2
  text                                    n
  <chr>                               <int>
1 This sentence                           2
2 I am going to luch                      5
3 This is a really nice and sunny day     8

有人可以帮助我吗？
我没有在我的函数中看到错误。
非常感谢！

更新：

我找到了一个（临时）解决方案：

get_unigrams <- function(text) {
  sapply(text, function(text){
  unigram<-  ngram(text, n = 1) %>% get.ngrams() %>% length()
  
  return(unigram)
  }
  )
}

但是，使用sapply-函数的解决方案非常慢（因为它单独执行每一行）。我有一个超过 10 万行的数据框。
有人可以帮我提高速度吗？例如使用矢量化函数？

【问题讨论】：

作为文体建议，我建议重命名您的函数。 get_unigrams 似乎会给出所有 unigram 的向量或列表，而不是 unigram 的数量。为清晰易读，请考虑将其重命名为 count_unigrams 或类似名称。

标签： r function

【解决方案1】：

另一种解决方案，基于stringr::str_count：

library(tidyverse)

df<-tribble(~text,
            "This sentence",
            "I am going to luch",
            "This is a really nice and sunny day")

df %>% 
  mutate(n = str_count(text, "\\w+"))

#> # A tibble: 3 × 2
#>   text                                    n
#>   <chr>                               <int>
#> 1 This sentence                           2
#> 2 I am going to luch                      5
#> 3 This is a really nice and sunny day     8

【讨论】：

【解决方案2】：

使用rowwise。查看?rowwise 了解更多信息。

df %>% rowwise() %>% 
  mutate(n=get_unigrams(text))

  text                                    n
  <chr>                               <int>
1 This sentence                           2
2 I am going to luch                      5
3 This is a really nice and sunny day     8

另一种解决方案（使用基数 R）是：

df$n <- apply(df, 1, get_unigrams)

【讨论】：

是的，使用 rowwise() 有效。但是，我有一个非常大的数据框，rowwise()-函数极大地增加了计算时间。
对，告诉我应用解决方案是否工作得更快。
我使用了sapply-function，查看我的更新。