从 www.kayak.it 抓取航班数据答案

【问题标题】：Scraping flights data from www.kayak.it从 www.kayak.it 抓取航班数据
【发布时间】：2021-10-06 14:52:10
【问题描述】：

我正在尝试使用 Selenium 从 Kayak 抓取数据，但我的代码不起作用，我无法理解原因。

我曾尝试通过以下方式关闭隐私按钮，但似乎无法解决问题。

cookie_banner = wd.find_elements_by_css_selector(".onetrust-accept-btn-handler")

cookie_banner[0].click()

你能帮帮我吗？非常感谢！

!pip install selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

import sys
import logging
from selenium.webdriver.remote.remote_connection import LOGGER
LOGGER.setLevel(logging.WARNING)
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
from tqdm import tqdm_notebook as tqdm
import pandas
import json
import pprint

chrome_options = webdriver.ChromeOptions() 
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36")

wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)

wd.get("https://www.kayak.it/explore/MIL-anywhere/20210801,20210801") 

import pprint 
detail_travels = [] 
for travel in list_travels: 
   url = travel.find_elements_by_css_selector("a")[0].get_attribute("href")
   destination = "" 
   country = ""
   travel_id = ""
   if(len(travel.find_elements_by_css_selector(".City__Name")) > 0): 
     destination = travel.find_elements_by_css_selector(".City__Name")[0].text 
   if(len(travel.find_elements_by_css_selector(".Country__Name")) > 0):
     country = travel.find_elements_by_css_selector(".Country__Name")[0].text
   travel_id = url

   detail_travels.append({'url': url,
                        'destination': destination,
                        'country': country,
                        'travel_id': travel_id})

len(detail_travels)
pprint.pprint(detail_travels[0:2])

【问题讨论】：

标签： python selenium web-scraping css-selectors selenium-chromedriver

【解决方案1】：

有时 Selenium 的开销可能有点太大。只是在浏览器工具中欺骗页面做了什么，我们发现调用了这个API：

https://www.kayak.it/s/horizon/exploreapi/destinations?airport=MIL&budget=&depart=20210801&return=20210801&duration=&exactDates=true&flightMaxStops=&stopsFilterActive=false&topRightLat=59.902761633461935&topRightLon=25.09658365167229&bottomLeftLat=26.101275008286677&bottomLeftLon=-6.719822598327707&zoomLevel=4&selectedMarker=&themeCode=&selectedDestination=

也许只使用 Python requests 模块并对这个 URL 执行 get 可能会好得多。有必要对 URL 进行一些修改以获取 API 确实接受的 URL，但我至少会在处理从结构化数据呈现的 HTML 之前尝试一下……

【讨论】：

很抱歉，我无法使用API，这段时间我才开始学习爬取。我的代码是否可能因为有时出现在屏幕上的隐私按钮而无法运行？在那种情况下，我该如何关闭它？非常感谢！

【解决方案2】：

您是否尝试在让 selenium 单击它之前等待按钮加载？当我尝试在浏览器中加载链接时，在加载隐私弹出窗口之前有一点延迟。

【讨论】：

我添加了以下代码，但它不起作用：'time.sleep(3) cookie_banner = wd.find_elements_by_css_selector("button") if len(cookie_banner) > 0: print('Privacy找到') cookie_banner[0].click()'