您可以使用with 完成此操作。关键是要了解管道是如何工作的以及word_tokenizer 是如何工作的。
管道获取其左侧的任何内容的输出,并将其作为第一个参数(默认情况下,但可以是任何参数)传递给其右侧 (RHS) 上的函数。 word_tokenizer 需要一个字符串作为参数。
您在管道的 LHS 上有一个数据框,因此在 RHS 上您需要一个接受数据框作为参数的函数,并且可以将该数据框中的列传递给另一个函数。在这种情况下,将text 字段中的字符串传递给word_tokenizer。 with 可以做到这一点。
text <- c("Because I could not stop for Death I add additional text-",
"He kindly stopped for me some additional text to act as a filler -",
"The Carriage held but just Ourselves more additional text to add to the body of the text-",
"and Immortality plus some more words to fill the text a little")
ID <- c(1,2,3,4)
output <- c(1,0,0,1)
df <- data.frame(cbind(ID, text, output))
df$text <- as.character(df$text)
library(text2vec)
df %>%
with(word_tokenizer(text))
# [[1]]
# [1] "Because" "I" "could" "not" "stop"
# [6] "for" "Death" "I" "add" "additional"
# [11] "text"
#
# [[2]]
# [1] "He" "kindly" "stopped" "for" "me"
# [6] "some" "additional" "text" "to" "act"
# [11] "as" "a" "filler"
#
# [[3]]
# [1] "The" "Carriage" "held" "but" "just"
# [6] "Ourselves" "more" "additional" "text" "to"
# [11] "add" "to" "the" "body" "of"
# [16] "the" "text"
#
# [[4]]
# [1] "and" "Immortality" "plus" "some"
# [5] "more" "words" "to" "fill"
# [9] "the" "text" "a" "little"
您还询问了如何将text2vec 的输出通过管道传输到itoken,并将其输出传输到create_vocabulary。同样,关键是要了解函数 LHS 返回什么以及 RHS 上的函数期望什么。 text2vec 返回一个列表,itoken 需要一个可迭代对象;列表是可迭代的,所以只需将text2vec 的输出直接传送到itoken。在您的评论中,您试图再次使用with,就好像text2vec 的输出是一个数据框一样。我发现这一点的方法是查看您正在使用的功能的帮助页面;这向我展示了他们所期待的论点类型。如果您不知道函数返回什么类型,您可以查阅帮助页面或将其输出通过管道传输到class。
library(text2vec)
df %>%
with(word_tokenizer(text)) %>%
itoken() %>%
create_vocabulary()
# |===============================================================| 100%
# Number of docs: 4
# 0 stopwords: ...
# ngram_min = 1; ngram_max = 1
# Vocabulary:
# term term_count doc_count
# 1: Because 1 1
# 2: stop 1 1
# 3: just 1 1
# 4: not 1 1
# 5: Immortality 1 1
# 6: little 1 1
# 7: filler 1 1
# 8: kindly 1 1
# 9: of 1 1
# 10: and 1 1
# 11: plus 1 1
# 12: fill 1 1
# 13: could 1 1
# 14: me 1 1
# 15: Carriage 1 1
# 16: but 1 1
# 17: body 1 1
# 18: stopped 1 1
# 19: as 1 1
# 20: He 1 1
# 21: act 1 1
# 22: The 1 1
# 23: Death 1 1
# 24: words 1 1
# 25: held 1 1
# 26: Ourselves 1 1
# 27: some 2 2
# 28: more 2 2
# 29: I 2 1
# 30: a 2 2
# 31: add 2 2
# 32: for 2 2
# 33: the 3 2
# 34: additional 3 3
# 35: to 4 3
# 36: text 5 4
# term term_count doc_count