使用缺失数据抓取亚马逊客户评论答案

【问题标题】：Scraping Amazon customer reviews with missing data使用缺失数据抓取亚马逊客户评论
【发布时间】：2018-07-21 23:32:17
【问题描述】：

我想抓取亚马逊客户评论，虽然如果没有“丢失”信息，我的代码可以正常工作，但如果部分数据丢失，将抓取的数据转换为数据框将不再有效（参数暗示不同行数）。

这是一个示例代码：

library(rvest) 

url <- read_html("https://www.amazon.de/product-reviews/3980710688/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=42&sortBy=recent")

get_reviews <- function(url) {

  title <- url %>%
    html_nodes("#cm_cr-review_list .a-color-base") %>%
    html_text()

  author <- url %>%
    html_nodes(".author") %>%
    html_text()

  df <- data.frame(title, author, stringsAsFactors = F)

  return(df)
} 

results <- get_reviews(url)

在这种情况下，“缺失”意味着没有为多个客户评论提供作者信息（Ein Kunde 仅表示 A customer 在德语中）。

有人知道如何解决这个问题吗？任何帮助表示赞赏。提前致谢！

【问题讨论】：

标签： r web-scraping amazon rvest

【解决方案1】：

会说这是您问题的答案 (link)

每个 'div[id*=customer_review]' 然后检查作者是否有该值。

【讨论】：

感谢链接，代码终于按预期运行了。 :-)

【解决方案2】：

从 Nardack 提供的链接中调整一种方法，我可以使用以下代码抓取数据：

library(dplyr)
library(rvest)

get_reviews <- function(node){

  r.title <- html_nodes(node, ".a-color-base") %>%
    html_text() 

  r.author <- html_nodes(node, ".author") %>%
    html_text() 

  df <- data.frame(
    title = ifelse(length(r.title) == 0, NA, r.title),
    author = ifelse(length(r.author) == 0, NA, r.author), 
    stringsAsFactors = F)

  return(df)  
}

url <- read_html("https://www.amazon.de/product-reviews/3980710688/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=42&sortBy=recent") %>% html_nodes("div[id*=customer_review]")
out <- lapply(url, get_reviews) %>% bind_rows()

【讨论】：