读取文本文件（包含 HTML 标记）并附加到数据框的新行时出错答案

【问题标题】：error in reading a text file (containing HTML tags) and appending into new rows of a dataframe读取文本文件（包含 HTML 标记）并附加到数据框的新行时出错
【发布时间】：2018-03-04 21:48:43
【问题描述】：

我正在尝试读取文件夹中的所有文本文件以及我在做什么：

从特定的 html 标签“TEXT”读取每个文本文件
存储列名为“MyText”的数据框
从下一个文本文件读取后追加下一行（如上）

我的代码是：

library(dplyr); library(readr); library(rvest); library(data.table); 

# List all the text files in the folder
files = list.files(pattern="*.txt")

# read from file and append to rows
tbl = lapply(files, read_html %>% html_nodes("text") %>%  html_text() ) %>% bind_rows()

这给我一个错误：

Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "function"

有人可以帮忙纠正我哪里错了吗？

【问题讨论】：

试试tbl = lapply(files, function(x) read_html(x) %>% html_nodes("text") %>% html_text() ) %>% bind_rows()
@AndrewGustar 感谢您帮助我：但我收到此错误Error in bind_rows_(x, .id) : Argument 1 must have names

标签： r dplyr data.table rvest readr

【解决方案1】：

问题的核心是read_html %>% html_nodes("text") %>% html_text() 不计算为函数。您可以通过以点开头的管道来使用 magrittr lambda，例如. %>% read_html %>% html_nodes("text") %>% html_text()。

然后最终html_text() 会给你一个向量，而不是你可以提供给bind_rows 的数据框。

您可以使用purrr::map_df()，而不是lapply/bind_rows：

library(purrr)
library(rvest)
map_df( files, ~ {
  file   <- .x
  MyText <- read_html(file) %>%
    html_nodes("text") %>%
    html_text() 
  tibble( file, MyText )
} )

【讨论】：

【解决方案2】：

这是我的解决方案。我检查了我的笔记本电脑，它正在工作：

# ________________ BELOW STEPS READS THE DATA SETS AND CREATES A DATAFRAME _______________________ 

# set default folder first
setwd("drive/you/folder/location")    

# read text files from the folders 
files <-list.files()

# create an empty dataframe
data <- data.frame()

# read files one by one and create dataframe
for (f in files) {

  # read as HTML
  dat <- read_html(f)

  # from data extract everything within <TEXT> and </TEXT> tags
  dat2 <- data.frame(Text = dat %>% html_nodes("text") %>%  html_text() , stringsAsFactors = F)

  # create new columns
  dat3 <- data.frame(Text = strsplit(dat2$Text, " \\| ")[[1]], stringsAsFactors = F)

  # create new columns from "Text"
  dat4 <- data.frame(Text = strsplit(dat3$Text[[3]], ":")[[1]], stringsAsFactors = F)

  # merge all the columns and rows after some basic text cleaning/processing
  NewsData <- data.frame(News_Paper = trimws(dat3$Text[1], which = "both"),
                         News_Class = trimws(dat3$Text[2], which = "both"),
                         Author_Location_Date = gsub("\r?\n|\r|\t|\\s+", " ", trimws(dat4$Text[1], which = "both")),
                         Text = gsub("\r?\n|\r|\t|\\s+", " ", trimws(dat4$Text[2], which = "both"))
  ) 

  # merge all the rows from remaining text files in the folder, one by one
  data <- rbind.data.frame(data, NewsData, make.row.names = F, stringsAsFactors = F)

} 

 # remove the unwanted dataframes
 rm(list=c("dat2", "dat3", "dat4"))


# ________________ END OF THE ABOVE STEPS ___________________________________________________

希望对你有所帮助。

【讨论】：

完美。这和我想要的一样。