产品价格的抓取页面为所有商品返回相同的价格答案

【问题标题】：Scraping page for product's price returns same price for all items产品价格的抓取页面为所有商品返回相同的价格
【发布时间】：2019-06-27 15:46:31
【问题描述】：

我正在为教育目标抓取网页。

我得到了这些值：producto（产品）、precio_antes（price_before）、precio_actual（price_now）和marca（品牌）。

我得到了正确的产品，但是：

precio_antes 当价格不同时，所有商品均返回 S/1,399.00
precio_actual 为所有项目返回 NA。
marca 为所有项目返回“lg”。

预期输出：

| ecommerce | marca | producto                                 | precio_antes | precio_actual |   |
|-----------|-------|------------------------------------------|--------------|---------------|---|
| wong      | lg    | LG Smart TV 49" Full HD 49LK5400         | S/1,399.00   | S/1,299.00    |   |
| wong      | lg    | LG Smart TV 60" 4K UHD 60UK6200 ThinQ AI | S/2,599.00   | S/2,299.00    |   |

当前输出

| ecommerce | marca | producto                                 | precio_antes | precio_actual |   |
|-----------|-------|------------------------------------------|--------------|---------------|---|
| wong      | lg    | LG Smart TV 49" Full HD 49LK5400         | S/1,399.00   | NA            |   |
| wong      | lg    | LG Smart TV 60" 4K UHD 60UK6200 ThinQ AI | S/1,399.00   | NA            |   |

我正在使用RSelenium，我认为我的 CSS 选择器技能需要变得更好。

library(RSelenium)
library(rvest)
library(dplyr)
library(stringr)



#start RSelenium


rD  <- rsDriver(port = 4570L, browser = "chrome", version = "latest", chromever = "75.0.3770.90",
                geckover = "latest", iedrver = NULL, phantomver = "2.1.1",
                verbose = TRUE, check = TRUE)



remDr <- rD[["client"]]

#navigate to your page
remDr$navigate("https://www.wong.pe/tecnologia/televisores/tv")

#scroll down 10 times, waiting for the page to load at each time
for(i in 1:10){      
  remDr$executeScript(paste("scroll(0,",i*10000,");"))
  Sys.sleep(3)    
}

#get the page html
page_source<-remDr$getPageSource()


product_info <- function(node){
  precio_antes <- html_nodes(node, 'span.product-prices__value') %>% html_text
  precio_actual <- html_nodes(node, 'span.product-prices__value product-prices__value--best-price') %>% html_text 
  marca <- html_nodes(node,"p.brand") %>% html_text
  producto <- html_nodes(node,"a.product-item__name") %>% html_text


  precio_antes <-   gsub("\\S\\/\\. ", "", precio_antes)
  precio_actual <-   gsub("\\S\\/\\. ", "", precio_actual)


  data.frame(
    ecommerce = "wong",
    marca = ifelse(length(marca)==0, NA, marca),
    producto = producto,
    precio_antes = ifelse(length(precio_antes)==0, NA, precio_antes),
    precio_actual = ifelse(length(precio_actual)==0, NA, precio_actual), 
    stringsAsFactors=F
  )


}



doc <- read_html(iconv(page_source[[1]]), to="UTF-8") %>% 
  html_nodes("div.category-shelf-wrapper")



wong_tvs <- lapply(doc, product_info) %>%
  bind_rows()

奖励：

我得到的西班牙字符的方式不正确，即使我正在使用：

LG Control Remoto MÃ¡gico AN-MR18BA #Should be Mágico

doc <- read_html(iconv(page_source[[1]]), to="UTF-8") %>% 
  html_nodes("div.category-shelf-wrapper")

为什么？

【问题讨论】：

我强烈建议不要结合多个问题。 “奖励”问题似乎比其他问题简单得多——我敢打赌，如果你把它作为一个单独的问题发布，你会很快得到答案。但是把它留在大问题的底部，你把它隐藏起来，阻止任何人回答，除非他们也能回答这个大问题。
您能否详细说明预期的输出。我部分理解了这个反面例子。一个积极的会很棒，谢谢！
@BigDataScientist，请查看原始问题中的更改。

标签： r rselenium

【解决方案1】：

编辑添加了一个很好的规范，谢谢！

我假设您想在输出中使用NA 再次跟踪丢失的元素。按照这个假设，与其他问题类似，我会再次选择父元素。

可以定位父元素，例如通过 xpath：/html/body/div/div/div/div/div/div/div/div/ul/li/div/div[@class = 'product-item__bottom'].

之后，您只需将结果拆分为所需的格式。

可重现的例子：

library(RSelenium)

rD <- rsDriver() 
remDr <- rD$client

url = "https://www.wong.pe/tecnologia/televisores"
remDr$navigate(url)

productElems = remDr$findElements(
  using = "xpath", 
  value = "/html/body/div/div/div/div/div/div/div/div/ul/li/div/div[@class = 'product-item__bottom']"
)

productInfoRaw = sapply(
  X = productElems, 
  FUN = function(elem) elem$getElementText()
)

splittedRaw = sapply(productInfoRaw, strsplit, split = "\n")
splitted = lapply(splittedRaw, function(split){
  if(length(split) == 5 &  "Online" %in% split){
    split[7] = split[4]
    split[4] = NA
  }
  return(split)
})

infos = data.frame(
  ecommerce = "wong",
  marca = sapply(splitted, "[", 2),
  producto = sapply(splitted, "[", 1),
  precio_antes = sapply(splitted, "[", 4),
  precio_actual = sapply(splitted, "[", 7)
)
head(infos)

输出：

> head(infos)
  ecommerce   marca                                 producto precio_antes precio_actual
1      wong      LG         LG Smart TV 49" Full HD 49LK5400   S/1,399.00    S/1,299.00
2      wong      LG LG Smart TV 60" 4K UHD 60UK6200 ThinQ AI   S/2,599.00    S/2,299.00
3      wong      LG       LG Control Remoto Mágico AN-MR18BA         <NA>      S/199.00
4      wong     AOC    AOC Smart TV 32'' HD LE32S5970S Linux     S/799.00      S/599.00
5      wong      LG             LG Smart TV 43" FHD 43LK5400   S/1,199.00      S/999.00
6      wong HISENSE  Hisense Televisor LED 32'' HD H3218H4IP   S/1,299.00      S/499.00

【讨论】：

为什么需要设置remDr$setWindowSize(2900, 3200)？
这实际上可能是多余的。有时您必须调整屏幕尺寸才能看到所有元素。首先我没有找到一些元素，所以我把它包括在内，但这里可能是多余的。
好的，也许将您的答案放入rsDriver 的电话是个好主意，这样人们可以在未来的情况下重现此答案。
我看到第三行有一个错误，其中有一个 NA 值，因为 S/199.00 应该在 precio_actual 列中，而 NA 在 precio_antes 中（因为我们没有知道之前的价格是多少）。这些情况怎么处理？
赏金的宽限期即将结束：代码是否适合您？

【解决方案2】：

Selenium 很慢，只能作为最后的手段使用。在这种情况下，它是不必要的，因为目录 API 是公开的。 API 还提供更丰富、结构良好的数据。一次可以请求 50 个项目，因此您可以递增 0、50 等，直到返回内容的总长度

URL 中的数字 1000144 和 1000098 指的是部门和类别，可以从 https://www.wong.pe/tecnologia/televisores/tv 的 HTML 中的 script 节点中提取。我在这里没有这样做是为了让事情变得简单，但如果你想要一个适应性更强的刮刀，这是可能的。

您也可以使用paste0，而不是glue。您可以使用lapply 代替map_df，然后使用do.call 和rbind 绑定行。您可以将cbind 与as.data.frame 一起使用，而不是bind_cols。我喜欢这些函数，因为它们简化了事情，避免了类型强制问题，并且总体上提高了我的代码的可读性，但是没有什么可以阻止你使用基本的 R 函数。

为了简单起见，我保留了原始变量名称。您可以使用names(tvs_df) <- … 或在调用map_df 后使用set_names(…) 更改它们，即map_df(…) %>% set_names(…)：

library(httr)   # for `GET`
library(glue)   # for `glue`, which allows cleaner syntax than `paste0`
library(purrr)  # for `map_df` to map over list and return as dataframe
library(dplyr)  # for `bind_cols`

i <- 0
cont_list <- list()

# Send requests and append data `cont_list` until fewer than 50 items returned.
repeat {
    url <- glue("https://www.wong.pe/api/catalog_system/pub/products/search/",
                "?&fq=C:/1000144/1000098/&_from={i}&_to={i + 49}")
    cont <- content(GET(url))
    cont_list <- c(cont_list, cont)
    if (length(cont) < 50) break
    i <- i + 50
}

# Names of desired data.
datl <- list(l1 = c("brand", "productName"),
             l2 = c("Price", "ListPrice", "AvailableQuantity"))

# Extract data 
tvs_df <- map_df(cont_list,
                 ~ bind_cols(source = "wong.pe", .[datl$l1],
                             .$items[[1]]$sellers[[1]]$commertialOffer[datl$l2]))

# A tibble: 54 x 6
   source  brand     productName                                 Price ListPrice AvailableQuantity
   <chr>   <chr>     <chr>                                       <dbl>     <dbl>             <int>
 1 wong.pe LG        "LG Smart TV 49\" Full HD 49LK5400"          1299      1399               276
 2 wong.pe LG        "LG Smart TV 60\" 4K UHD 60UK6200 ThinQ AI"  2299      2599                18
 3 wong.pe LG        LG Control Remoto Mágico AN-MR18BA            199       199                37
 4 wong.pe AOC       AOC Smart TV 32'' HD LE32S5970S Linux         599       799                90
 5 wong.pe LG        "LG Smart TV 43\" FHD 43LK5400"               999      1199               303
 6 wong.pe Hisense   Hisense Televisor LED 32'' HD H3218H4IP       499      1299                22
 7 wong.pe LG        "LG Smart TV 55\" 4K UHD 55UK6200 ThinQ AI"  1799      2199                31
 8 wong.pe Panasonic Panasonic Smart TV Viera 32'' HD 32FS500      799       999                 4
 9 wong.pe AOC       AOC Smart TV 55'' 4K UHD 55U7970 Linux       1299      2499                 3
10 wong.pe AOC       AOC Televisor LED 32'' HD 32M1370             499       699                 4
# … with 44 more rows

【讨论】：

如何检测API 何时暴露在页面中？例如，您介意看看https://www.linio.com.pe/ 和https://www.lacuracao.pe/ 这些页面吗？
泰。它有效，但更喜欢 BigDataScientist 方法，因为我更了解 RSelenium。