【问题标题】:Using the Nested List Column Approach and Purrr Together with Tidytext::Unnest_Tokens将嵌套列表列方法和 Purrr 与 Tidytext::Unnest_Tokens 一起使用
【发布时间】:2017-06-30 23:58:13
【问题描述】:

我有一个数据框,其中包含调查回复,每行代表不同的人。一栏——“文本”——是一个开放式文本问题。我想使用 Tidytext::unnest_tokens 以便按每一行进行文本分析,包括情感分数、字数等。

这里是这个例子的简单数据框:

Satisfaction<-c ("Satisfied","Satisfied","Dissatisfied","Satisfied","Dissatisfied")
Text<-c("I'm very satisfied with the services", "Your service providers are always late which causes me a lot of frustration", "You should improve your staff training, service providers have bad customer service","Everything is great!","Service is bad")
Gender<-c("M","M","F","M","F")
df<-data.frame(Satisfaction,Text,Gender)

然后我把 Text 列变成了字符...

df$Text<-as.character(df$Text)

接下来我按 id 列分组并嵌套数据框。

df<-df%>%mutate(id=row_number())%>%group_by(id)%>%unnest_tokens(word,Text)%>%nest(-id)

到目前为止似乎还不错,但现在我如何使用 purrr::map 函数处理嵌套列表列“word”?例如,如果我想使用 dplyr::mutate 创建一个新列,并为每行提供字数?

另外,有没有更好的方法来嵌套数据框,以便只有“文本”列是嵌套列表?

【问题讨论】:

  • 不是很清楚你想要什么。无需使用purrr::nest 即可进行文本分析,只需在unnest_tokens 之后停止即可。如果你只想嵌套单词列,你可以做nest(word),但要让它工作,你必须先ungroup数据框(或者首先不要按id分组)

标签: r dplyr tidyr purrr tidytext


【解决方案1】:

我喜欢用purrr::map 来做modeling for different groups,但是对于你所说的,我认为你可以坚持直接使用dplyr。

您可以像这样设置数据框:

library(dplyr)
library(tidytext)

Satisfaction <- c("Satisfied",
                  "Satisfied",
                  "Dissatisfied",
                  "Satisfied",
                  "Dissatisfied")

Text <- c("I'm very satisfied with the services",
          "Your service providers are always late which causes me a lot of frustration", 
          "You should improve your staff training, service providers have bad customer service",
          "Everything is great!",
          "Service is bad")

Gender <- c("M","M","F","M","F")

df <- data_frame(Satisfaction, Text, Gender)

tidy_df <- df %>% 
    mutate(id = row_number()) %>% 
    unnest_tokens(word, Text)

然后要查找例如每行的字数,可以使用group_bymutate

tidy_df %>%
    group_by(id) %>%
    mutate(num_words = n()) %>%
    ungroup
#> # A tibble: 37 × 5
#>    Satisfaction Gender    id      word num_words
#>           <chr>  <chr> <int>     <chr>     <int>
#> 1     Satisfied      M     1       i'm         6
#> 2     Satisfied      M     1      very         6
#> 3     Satisfied      M     1 satisfied         6
#> 4     Satisfied      M     1      with         6
#> 5     Satisfied      M     1       the         6
#> 6     Satisfied      M     1  services         6
#> 7     Satisfied      M     2      your        13
#> 8     Satisfied      M     2   service        13
#> 9     Satisfied      M     2 providers        13
#> 10    Satisfied      M     2       are        13
#> # ... with 27 more rows

您可以通过实现内部联接来进行情感分析;查看some examples here

【讨论】:

  • 感谢您的帮助和示例!
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-12-13
  • 2017-01-06
  • 2019-06-25
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多