dplyr 管道函数中的 word_tokenizer - 输出到列表答案

【问题标题】：word_tokenizer in dplyr pipe function - output to listdplyr 管道函数中的 word_tokenizer - 输出到列表
【发布时间】：2019-09-10 12:01:57
【问题描述】：

我正在尝试使用 dplyr 管道函数并从 text2vec 包中应用 word_tokenizer。

一些数据：

text <- c("Because I could not stop for Death I add additional text-",
          "He kindly stopped for me some additional text to act as a filler -",
          "The Carriage held but just Ourselves more additional text to add to the body of the text-",
          "and Immortality plus some more words to fill the text a little")

ID <- c(1,2,3,4)
output <- c(1,0,0,1)
df <- data.frame(cbind(ID, text, output))
df$text <- as.character(df$text)



library(text2vec)

df %>%
  word_tokenizer(text)

同时发出警告；

df %>%
  mutate(word_tokenizer(text))

提供一些输出，但不是我期望的列表格式。

正确的方法是使用word_tokenizer(df$text)。我只是想知道如何使用管道函数来执行此操作，因为在这部分之前我还有一些其他处理。

我还想使用itoken() 和create_vocabulary() 完成管道。

【问题讨论】：

df %>% mutate(new_col = word_tokenizer(text)) 确实为您提供与word_tokenizer(df$text) 相同的输出

标签： r

【解决方案1】：

您可以使用with 完成此操作。关键是要了解管道是如何工作的以及word_tokenizer 是如何工作的。

管道获取其左侧的任何内容的输出，并将其作为第一个参数（默认情况下，但可以是任何参数）传递给其右侧 (RHS) 上的函数。 word_tokenizer 需要一个字符串作为参数。

您在管道的 LHS 上有一个数据框，因此在 RHS 上您需要一个接受数据框作为参数的函数，并且可以将该数据框中的列传递给另一个函数。在这种情况下，将text 字段中的字符串传递给word_tokenizer。 with 可以做到这一点。

text <- c("Because I could not stop for Death I add additional text-",
          "He kindly stopped for me some additional text to act as a filler -",
          "The Carriage held but just Ourselves more additional text to add to the body of the text-",
          "and Immortality plus some more words to fill the text a little")

ID <- c(1,2,3,4)
output <- c(1,0,0,1)
df <- data.frame(cbind(ID, text, output))
df$text <- as.character(df$text)

library(text2vec)

df %>%
  with(word_tokenizer(text))

# [[1]]
# [1] "Because"    "I"          "could"      "not"        "stop"      
# [6] "for"        "Death"      "I"          "add"        "additional"
# [11] "text"      
# 
# [[2]]
# [1] "He"         "kindly"     "stopped"    "for"        "me"        
# [6] "some"       "additional" "text"       "to"         "act"       
# [11] "as"         "a"          "filler"    
# 
# [[3]]
# [1] "The"        "Carriage"   "held"       "but"        "just"      
# [6] "Ourselves"  "more"       "additional" "text"       "to"        
# [11] "add"        "to"         "the"        "body"       "of"        
# [16] "the"        "text"      
# 
# [[4]]
# [1] "and"         "Immortality" "plus"        "some"       
# [5] "more"        "words"       "to"          "fill"       
# [9] "the"         "text"        "a"           "little"

您还询问了如何将text2vec 的输出通过管道传输到itoken，并将其输出传输到create_vocabulary。同样，关键是要了解函数 LHS 返回什么以及 RHS 上的函数期望什么。 text2vec 返回一个列表，itoken 需要一个可迭代对象；列表是可迭代的，所以只需将text2vec 的输出直接传送到itoken。在您的评论中，您试图再次使用with，就好像text2vec 的输出是一个数据框一样。我发现这一点的方法是查看您正在使用的功能的帮助页面；这向我展示了他们所期待的论点类型。如果您不知道函数返回什么类型，您可以查阅帮助页面或将其输出通过管道传输到class。

library(text2vec)

df %>%
  with(word_tokenizer(text)) %>%
  itoken() %>%
  create_vocabulary()

# |===============================================================| 100%
# Number of docs: 4 
# 0 stopwords:  ... 
# ngram_min = 1; ngram_max = 1 
# Vocabulary: 
#   term term_count doc_count
# 1:     Because          1         1
# 2:        stop          1         1
# 3:        just          1         1
# 4:         not          1         1
# 5: Immortality          1         1
# 6:      little          1         1
# 7:      filler          1         1
# 8:      kindly          1         1
# 9:          of          1         1
# 10:         and          1         1
# 11:        plus          1         1
# 12:        fill          1         1
# 13:       could          1         1
# 14:          me          1         1
# 15:    Carriage          1         1
# 16:         but          1         1
# 17:        body          1         1
# 18:     stopped          1         1
# 19:          as          1         1
# 20:          He          1         1
# 21:         act          1         1
# 22:         The          1         1
# 23:       Death          1         1
# 24:       words          1         1
# 25:        held          1         1
# 26:   Ourselves          1         1
# 27:        some          2         2
# 28:        more          2         2
# 29:           I          2         1
# 30:           a          2         2
# 31:         add          2         2
# 32:         for          2         2
# 33:         the          3         2
# 34:  additional          3         3
# 35:          to          4         3
# 36:        text          5         4
# term term_count doc_count

【讨论】：

谢谢，这行得通。我正在尝试使用with(word_tokenizer(text)) %>% with(itoken(.)) %>% with(create_vocabulary()) 更进一步，但我得到“找不到函数 create_vocabulary()
@user8959427 我已更新我的答案以完成您通过create_vocabulary 请求的管道。您是否会编辑您的问题以在那里而不是在 cmets 中询问？
谢谢！现在会做