如何使用 TidyText 将多行合并为一答案

【问题标题】：How to Combine Multiple Rows Into One Using TidyText如何使用 TidyText 将多行合并为一
【发布时间】：2019-06-14 22:27:43
【问题描述】：

我正在看一本小说，想在整本书中寻找人物名字的出现。有些人物的名字不同。例如，字符“Sissy Jupe”由“Sissy”和“Jupe”组成。我想将两行字数合二为一，这样我就可以看到“Sissy Jupe”的计数。

我查看过使用 sum、rbind、merge 和其他使用留言板的方法，但似乎没有任何效果。很多很好的例子，但它们不起作用。

library(tidyverse) 
library(gutenbergr)
library(tidytext)

ht <- gutenberg_download(786)

ht_chap <- ht %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE))))

tidy_ht <- ht_chap %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) # preserves online letters; removes _)

ht_count <- tidy_ht %>%
  group_by(chapter) %>%
  count(word, sort = TRUE) %>%
  ungroup %>%
  complete(chapter, word,
           fill = list(n = 0)) 

gradgrind <- filter(ht_count, word == "gradgrind")
bounderby <- filter (ht_count, word == "bounderby")
sissy <- filter (ht_count, word == "sissy")

## TEST
sissy_jupe <- ht_count %>% 
  filter(word %in% c("sissy", "jupe"))

我想要一个名为“sissy_jupe”的单个“单词”项目，它按章节计算 n。这很接近，但不是。

# A tibble: 76 x 3
   chapter word      n
     <int> <chr> <dbl>
 1       0 jupe      0
 2       0 sissy     1
 3       1 jupe      0
 4       1 sissy     0
 5       2 jupe      5
 6       2 sissy     9
 7       3 jupe      3
 8       3 sissy     1
 9       4 jupe      1
10       4 sissy     0
# … with 66 more rows

【问题讨论】：

这个问题是“mcve”。干得好！

标签： r dplyr tidytext

【解决方案1】：

下面的代码应该会为您提供所需的输出。

library(tidyverse)
df %>% group_by(chapter) %>% 
  mutate(n = sum(n),
         word = paste(word, collapse="_")) %>% 
  distinct(chapter, .keep_all = T)

【讨论】：

非常感谢，西奥！有用。非常感谢您帮助新手。这意味着很多。

【解决方案2】：

欢迎来到 stackoverflow 汤姆。这是一个想法：

基本上，（1）在整理的tibble中找到“sissy”或“jupe”并替换为“sissy_jupe”，（2）像你一样创建ht_count，（3）打印结果：

library(tidyverse) 
library(gutenbergr)
library(tidytext)

ht <- gutenberg_download(786)

ht_chap <- ht %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE))))

tidy_ht <- ht_chap %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) # preserves online letters; removes _)

# NEW CODE START
tidy_ht <- tidy_ht %>%
  mutate(word = str_replace_all(word, "sissy|jupe", replacement = "sissy_jupe"))
# END NEW CODE

ht_count <- tidy_ht %>%
  group_by(chapter) %>%
  count(word, sort = TRUE) %>%
  ungroup %>%
  complete(chapter, word,
           fill = list(n = 0))

# NEW CODE
sissy_jupe <- ht_count %>% 
  filter(str_detect(word, "sissy_jupe"))
# END

...产生...

# A tibble: 38 x 3
   chapter word           n
     <int> <chr>      <dbl>
 1       0 sissy_jupe     1
 2       1 sissy_jupe     0
 3       2 sissy_jupe    14
 4       3 sissy_jupe     4
 5       4 sissy_jupe     1
 6       5 sissy_jupe     5
 7       6 sissy_jupe    20
 8       7 sissy_jupe     7
 9       8 sissy_jupe     2
10       9 sissy_jupe    38
# ... with 28 more rows

如果我们的任何解决方案对您有帮助（反馈 = 更好的编码人员），请不要忘记点赞/点击复选标记。

【讨论】：