为什么我不能标记文本数据答案

【问题标题】：why I can't tokenize the text data为什么我不能标记文本数据
【发布时间】：2020-02-11 23:51:38
【问题描述】：

我需要将文本数据标记为下面的代码，但会产生错误。如何解决？谢谢！

library(readr)
europeecondata <- read_csv("C:/Users/lin/Documents/europeecondata.csv")

european_text <- data_frame(line=1:273, text=europeecondata$text)


european_text$text <- gsub("http[^[:space:]]*","",  european_text$text) # For http
european_text$text <- gsub("http[^[:space:]]*","", european_text$text) # For https


data(stop_words)
euro_tokens <- european_text$text %>%
   unnest_tokens(word, text) %>%
   anti_join(stop_words)%>%
   count(word, sort=T)

输出： UseMethod("unnest_tokens_") 中的错误：没有适用于“字符”类对象的“unnest_tokens_”方法

【问题讨论】：

See here 提出一个人们可以帮助解决的 R 问题。这包括数据样本；现在我们无法在没有任何数据的情况下运行您的任何代码，也看不到您正在使用什么

标签： r text

【解决方案1】：

unnest_tokens 需要 tbl 作为 data.frame。在 OP 的代码中，列被提取并作为vector 传递。相反，它会是

library(tidytext)
library(dplyr)
european_text %>%
    unnest_tokens(word, text)

根据?unnest_tokens，用法是

unnest_tokens(tbl, output, input, token = "words", format = c("text", “人”，“乳胶”，“html”，“xml”），to_lower = TRUE，drop = TRUE，崩溃 = NULL, ...)

在哪里

tbl - 数据帧

使用可重现的示例

library(janeaustenr)
d <- tibble(txt = prideprejudice)
d$txt %>%
   unnest_tokens(word, txt)

UseMethod("unnest_tokens_") 中的错误：没有适用于“字符”类对象的“unnest_tokens_”方法

相反，如果我们这样做

d %>%
   unnest_tokens(word, txt)
# A tibble: 122,204 x 1
#   word     
#   <chr>    
# 1 pride    
# 2 and      
# 3 prejudice
# 4 by       
# 5 jane     
# 6 austen   
# 7 chapter  
# 8 1        
# 9 it       
#10 is       
# … with 122,194 more rows

【讨论】：