【问题标题】:Unable to grab certain links from dynamic content无法从动态内容中抓取某些链接
【发布时间】:2018-12-22 12:24:33
【问题描述】:

我在 python 中结合 selenium 编写了一个脚本,以从其登陆页面抓取位于地图右侧区域的不同属性的链接。

Link to the landing page

当我从 chrome 手动单击每个块时,我会在新选项卡中看到包含此​​ /for_sale/ 部分的链接,而我的脚本获取的内容包含 /homedetails/

我如何才能获得结果的数量(例如 153 套待售房屋)以及指向房产的正确链接?

到目前为止我的尝试:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://www.zillow.com/homes/33155_rb/"

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get(link)

itemcount = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"#map-result-count-message h2")))
print(itemcount.text)

for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".zsg-photo-card-overlay-link"))):
    print(item.get_attribute("href"))
driver.quit()

当前输出之一:

https://www.zillow.com/homedetails/6860-SW-48th-Ter-Miami-FL-33155/44206318_zpid/

这样的预期输出之一:

https://www.zillow.com/homes/for_sale/Miami-FL-33155/house_type/44184455_zpid/72458_rid/globalrelevanceex_sort/25.776783,-80.256072,25.695446,-80.364905_rect/12_zm/0_mmm/

【问题讨论】:

  • 至于 itemcount,我相信它是在页面加载后填充的,因此您需要某种延迟/睡眠。至于不正确的链接,您可以使用 css 选择器获取带有 homedetails 的链接,因此只需将其更改为您需要的任何内容。

标签: python python-3.x selenium selenium-webdriver web-scraping


【解决方案1】:

在分析 /homedetails/ 和 /for_sale/ 链接时,我发现 /homedetails/ 链接通常包含如下代码:

44206318_zpid

该代码充当广告帖子的唯一标识符,我将其提取并添加到:

https://www.zillow.com/homes/for_sale/

所以广告帖子的最终链接将是这样的:

https://www.zillow.com/homes/for_sale/44206318_zpid

这是一个有效的链接,并指向广告帖子。

这是最终的脚本:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://www.zillow.com/homes/33155_rb/"

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get(link)

itemcount = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"#map-result-count-message h2")))
print(itemcount.text)

for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".zsg-photo-card-overlay-link"))):
    link = item.get_attribute("href")
    if "zpid" in link:
        print("https://www.zillow.com/homes/for_sale/{}".format(link.split('/')[-2]))

我希望这会有所帮助。

【讨论】:

    【解决方案2】:

    您可以遍历divs 的分页并保持一个运行计数器来记录每页上显示的房屋数量。要解析html,此答案使用BeautifulSoup

    from selenium import webdriver
    from bs4 import BeautifulSoup as soup
    import re, time
    def home_num(_d:soup) -> int:
      return len(_d.find_all('a', {'href':re.compile('^/homedetails/')}))
    
    d = webdriver.Chrome('/Users/jamespetullo/Downloads/chromedriver')
    d.get('https://www.zillow.com/homes/33155_rb/')
    homecount, _links = home_num(soup(d.page_source, 'html.parser')), []
    _seen_links, _result_links = [], []
    _start = [i for i in d.find_elements_by_tag_name('a') if isinstance(i.get_attribute("href"), str) and re.findall('/homes/for_sale/', i.get_attribute("href")) and i.get_attribute("href") not in _seen_links]
    while _start:
      _new_start = _start[0]
      try:
         _new_start.send_keys('\n')
         time.sleep(5)
         _start = [i for i in d.find_elements_by_tag_name('a') if isinstance(i.get_attribute("href"), str) and re.findall('/homes/for_sale/', i.get_attribute("href")) and i.get_attribute("href") not in _seen_links]
      except:
        _seen_links.append(_new_start.get_attribute('href'))
        _start = [i for i in d.find_elements_by_tag_name('a') if isinstance(i.get_attribute("href"), str) and re.findall('/homes/for_sale/', i.get_attribute("href")) and i.get_attribute("href") not in _seen_links]
      else:
         _seen_links.append(_new_start.get_attribute('href'))
         _result_links.append(_new_start.get_attribute('href'))
         homecount += home_num(soup(d.page_source, 'html.parser'))
    

    【讨论】:

    • 您的脚本在 while 循环 @Ajax1234 之前遇到行时会引发错误 stale element reference:
    【解决方案3】:

    如果您检查页面右侧的图片,您会看到“homedetails”而不是“forsale”。 只需尝试在新选项卡中打开链接并观察实际链接是“homedetails”。

    【讨论】:

    • 我所说的链接不在源中。单击每个容器@rishav prasher 后,它们会在新选项卡中动态生成。
    猜你喜欢
    • 2019-03-23
    • 2023-01-04
    • 1970-01-01
    • 2020-11-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-12-25
    • 2021-04-08
    相关资源
    最近更新 更多