无法使用来自 r 的 rvest 包读取带有 read_html 的网页答案

【问题标题】：could not read webpage with read_html using rvest package from r无法使用来自 r 的 rvest 包读取带有 read_html 的网页
【发布时间】：2023-03-15 15:30:01
【问题描述】：

我正在尝试从亚马逊获取产品评论者的位置。比如这个网页

[https://www.amazon.com/gp/profile/amzn1.account.AH55KF4JK5IKKJ77MPOLHOR4YAQQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8][1]

我需要得到HAINESVILLE, ILLINOIS, United States

我使用 rvest 包进行网页抓取。

这是我所做的：

library(rvest)       
url='https://www.amazon.com/gp/profile/amzn1.account.AH55KF4JK5IKKJ77MPOLHOR4YAQQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'
page = read_html(url)

我收到如下错误：

Error in open.connection(x, "rb") : HTTP error 403.

但是，以下工作：

con <- url(url, "rb")
page = read_html(con)

但是，在我阅读的页面中，我无法提取任何文本。例如，我想提取审稿人的位置。

page %>%
    html_nodes("#customer-profile-name-header .a-size-base a-color-base")%>%
    html_text()

我什么都没有

character(0)

谁能帮我弄清楚我做错了什么？提前非常感谢。

【问题讨论】：

您是如何选择“#customer-profile-name-header .a-size-base a-color-base”的您是否使用了选择器小工具？这可能会有所帮助：queryxchange.com/q/27_51801321/…

标签： r web-scraping rvest

【解决方案1】：

这应该可行：

library(dplyr)
library(rvest)
library(stringr)

# get url
url='https://www.amazon.com/gp/profile/amzn1.account.AH55KF4JK5IKKJ77MPOLHOR4YAQQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'

# open page
con <- url(url, "rb")
page = read_html(con)

# get the desired information, using View Page Source
page %>%
  html_nodes(xpath=".//script[contains(., 'occupation')]")%>%
  html_text() %>% as.character() %>% str_match(.,"location\":\"(.*?)\",\"personalDescription") -> res

res[,2]

【讨论】：

你检查过这段代码吗？例如，为什么在库调用中在包名两边加上双引号？函数as_character 缺少库调用？当我执行你的代码时，我得到了这个错误，Error in as_character(.) : could not find function "as_character"
@captcoma 您的代码在将as_character 替换为as.character(.) 后工作。非常感谢你的回答！我有两个问题： 1. 使用查看源后如何快速找到我需要的元素。很多文字..... 2.为什么read_html不能直接在url上工作？
感谢您的提示，我更正了我的答案。 1：我使用搜索功能（ctrl +s）并查找所需的信息2：可能是因为用户代理中的NULL，请参见此处：stackoverflow.com/questions/35690914/…
@captcoma 你介意看看我提出类似问题的post 吗？那将非常有帮助。非常感谢