【问题标题】:Scraping multiple pages of Zillow抓取 Zillow 的多页
【发布时间】:2021-12-27 05:23:03
【问题描述】:

我正在尝试使用下面包含的网站在 Zillow 的两个页面之间抓取大约 54 个“代理列表”和 11 个“其他列表”,但我的代码仅在第一个生成“代理列表”的前 20 个结果搜索结果页面。如何修改我的代码以获取“代理列表”和“其他列表”的所有页面上的所有结果?

res_all <-NULL

for (page_result in 1:40) {
  zillow_url = paste0("https://www.zillow.com/providence-ri/duplex/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%22Providence%2C%20RI%22%2C%22mapBounds%22%3A%7B%22west%22%3A-71.48892251635742%2C%22east%22%3A-71.36017648364258%2C%22south%22%3A41.77131876826507%2C%22north%22%3A41.862664689400106%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A26637%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22globalrelevanceex%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22sf%22%3A%7B%22value%22%3Afalse%7D%2C%22tow%22%3A%7B%22value%22%3Afalse%7D%2C%22con%22%3A%7B%22value%22%3Afalse%7D%2C%22apco%22%3A%7B%22value%22%3Afalse%7D%2C%22land%22%3A%7B%22value%22%3Afalse%7D%2C%22apa%22%3A%7B%22value%22%3Afalse%7D%2C%22manu%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A13%7D")

zpg = read_html(zillow_url)

zillow_pg <-tibble(
  addr = zpg %>% html_nodes(".list-card-addr") %>% html_text(),
  price = zpg %>% html_nodes(".list-card-price") %>% html_text(),
  details = zpg %>% html_nodes(".list-card-details") %>% html_text() ,
  heading= zpg %>% html_nodes(".list-card-info a") %>% html_text() ,
  type = zpg %>% html_nodes(".list-card-statusText") %>% html_text())


res_all <- distinct(bind_rows(res_all, zillow_pg))
}

【问题讨论】:

    标签: r web-scraping rvest


    【解决方案1】:

    您需要RSelenium,因为页面是动态加载的。

    这是提取价格的部分答案。

    启动浏览器

    library(rvest)
    library(dplyr)
    library(RSelenium)
    driver = rsDriver(browser = c("firefox"))
    remDr <- driver[["client"]]
    remDr$navigate(url)
    

    现在加载所有列表

    remDr$findElement(using = 'xpath', value = '//*[@id="grid-search-results"]/div[1]/h1')$clickElement()
    webElem <- remDr$findElement("css", "body")
    #scrolling to the end of webpage. 
    webElem$sendKeysToElement(list(key = "end"))
    webElem$sendKeysToElement(list(key = "home"))
    

    如果您无法获取所有商品的价格,请重复最后两个步骤。

    remDr$getPageSource()[[1]] %>% 
      read_html()   %>% 
      html_nodes(".list-card-price") %>% html_text()
     [1] "$399,999"   "$449,900"   "$399,000"   "$469,900"   "$310,000"   "$319,900"   "$404,900"   "$320,000"   "$529,000"   "$750,000"   "$335,000"   "$299,000"  
    [13] "$349,900"   "$314,900"   "$369,999"   "$359,000"   "$149,900"   "$309,900"   "$377,000"   "$360,000"   "$699,900"   "$410,000"   "$634,900"   "$310,000"  
    [25] "$695,000"   "$395,000"   "$339,900"   "$399,900"   "$350,000"   "$369,900"   "$639,000"   "$3,995,000" "$799,000"   "$699,000"   "$349,000"   "$448,000" 
    

    现在转到第 2 页,获取剩余列表

    remDr$findElement(using = 'xpath', value = '//*[@id="grid-search-results"]/div[3]/nav/ul/li[5]/a')$clickElement()
    remDr$getPageSource()[[1]] %>% 
      read_html()   %>% 
      html_nodes(".list-card-price") %>% html_text()
     [1] "$575,000"   "$299,000"   "$369,900"   "$345,500"   "$799,000"   "$380,000"   "$300,000"   "$1,295,000" "$575,000"   "$575,000"   "$599,900"   "$799,000"  
    [13] "$474,900"   "$399,900" 
    

    现在从其他列表部分获取价格。

    remDr$findElement(using = 'xpath', value = '//*[@id="grid-search-results"]/div[1]/div/div[1]/div/button[2]')$clickElement()
    remDr$getPageSource()[[1]] %>% 
      read_html()   %>% 
      html_nodes(".list-card-price") %>% html_text()
     [1] "$315,000"      "$350,000"      "$439,000"      "$350,000"      "$395,000"      "$315,000"      "Est. $396,600" "Est. $681,300" "$234,000"     
    [10] "$449,900"      "$249,900"      "Est. $310,300"
    

    【讨论】:

      猜你喜欢
      • 2022-11-07
      • 2021-07-12
      • 2023-02-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多