【问题标题】:error in reading a text file (containing HTML tags) and appending into new rows of a dataframe读取文本文件(包含 HTML 标记)并附加到数据框的新行时出错
【发布时间】:2018-03-04 21:48:43
【问题描述】:

我正在尝试读取文件夹中的所有文本文件以及我在做什么:

  1. 从特定的 html 标签“TEXT”读取每个文本文件
  2. 存储列名为“MyText”的数据框
  3. 从下一个文本文件读取后追加下一行(如上)

我的代码是:

library(dplyr); library(readr); library(rvest); library(data.table); 

# List all the text files in the folder
files = list.files(pattern="*.txt")

# read from file and append to rows
tbl = lapply(files, read_html %>% html_nodes("text") %>%  html_text() ) %>% bind_rows()

这给我一个错误:

Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "function"

有人可以帮忙纠正我哪里错了吗?

【问题讨论】:

  • 试试tbl = lapply(files, function(x) read_html(x) %>% html_nodes("text") %>% html_text() ) %>% bind_rows()
  • @AndrewGustar 感谢您帮助我:但我收到此错误Error in bind_rows_(x, .id) : Argument 1 must have names

标签: r dplyr data.table rvest readr


【解决方案1】:

问题的核心是read_html %>% html_nodes("text") %>% html_text() 不计算为函数。您可以通过以点开头的管道来使用 magrittr lambda,例如. %>% read_html %>% html_nodes("text") %>% html_text()

然后最终html_text() 会给你一个向量,而不是你可以提供给bind_rows 的数据框。

您可以使用purrr::map_df(),而不是lapply/bind_rows

library(purrr)
library(rvest)
map_df( files, ~ {
  file   <- .x
  MyText <- read_html(file) %>%
    html_nodes("text") %>%
    html_text() 
  tibble( file, MyText )
} )

【讨论】:

    【解决方案2】:

    这是我的解决方案。我检查了我的笔记本电脑,它正在工作:

    # ________________ BELOW STEPS READS THE DATA SETS AND CREATES A DATAFRAME _______________________ 
    
    # set default folder first
    setwd("drive/you/folder/location")    
    
    # read text files from the folders 
    files <-list.files()
    
    # create an empty dataframe
    data <- data.frame()
    
    # read files one by one and create dataframe
    for (f in files) {
    
      # read as HTML
      dat <- read_html(f)
    
      # from data extract everything within <TEXT> and </TEXT> tags
      dat2 <- data.frame(Text = dat %>% html_nodes("text") %>%  html_text() , stringsAsFactors = F)
    
      # create new columns
      dat3 <- data.frame(Text = strsplit(dat2$Text, " \\| ")[[1]], stringsAsFactors = F)
    
      # create new columns from "Text"
      dat4 <- data.frame(Text = strsplit(dat3$Text[[3]], ":")[[1]], stringsAsFactors = F)
    
      # merge all the columns and rows after some basic text cleaning/processing
      NewsData <- data.frame(News_Paper = trimws(dat3$Text[1], which = "both"),
                             News_Class = trimws(dat3$Text[2], which = "both"),
                             Author_Location_Date = gsub("\r?\n|\r|\t|\\s+", " ", trimws(dat4$Text[1], which = "both")),
                             Text = gsub("\r?\n|\r|\t|\\s+", " ", trimws(dat4$Text[2], which = "both"))
      ) 
    
      # merge all the rows from remaining text files in the folder, one by one
      data <- rbind.data.frame(data, NewsData, make.row.names = F, stringsAsFactors = F)
    
    } 
    
     # remove the unwanted dataframes
     rm(list=c("dat2", "dat3", "dat4"))
    
    
    # ________________ END OF THE ABOVE STEPS ___________________________________________________ 
    

    希望对你有所帮助。

    【讨论】:

    • 完美。这和我想要的一样。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2013-03-02
    • 1970-01-01
    • 1970-01-01
    • 2020-12-07
    • 2022-12-05
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多