【问题标题】:Python scraping 'things to do' from tripadvisorPython 从tripadvisor 抓取“要做的事情”
【发布时间】:2018-11-23 20:58:45
【问题描述】:

this 页面,我想抓取“迈阿密要做的事情类型”列表(您可以在页面末尾附近找到它)。到目前为止,这是我所拥有的:

import requests
from bs4 import BeautifulSoup

# Define header to prevent errors
user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"

headers = {'User-Agent': user_agent}

new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
# Get response from url
response = requests.get(new_url, headers = headers)
# Encode response for parsing
html = response.text.encode('utf-8')
# Soupify response
soup = BeautifulSoup(html, "lxml")

tag_elements = soup.findAll("a", {"class":"attractions-attraction-overview-main-Pill__pill--23S2Q"})

# Iterate over tag_elements and exctract strings
tags_list = []
for i in tag_elements:
    tags_list.append(i.string)

问题是,我得到像'Good for Couples (201)', 'Good for Big Groups (130)', 'Good for Kids (100)' 这样的值,这些值来自页面的“事物类型...”部分下方的页面“迈阿密常用搜索”区域。我也没有得到我需要的一些值,比如"Traveler Resources (7)", "Day Trips (7)" 等。这两个列表“要做的事情......”和“常用搜索......”的类名是相同的,我正在使用类在soup.findAll() 我猜这可能是这个问题的原因。这样做的正确方法是什么?我应该采取其他方法吗?

【问题讨论】:

    标签: python web-scraping beautifulsoup tripadvisor


    【解决方案1】:

    这在浏览器中非常简单:

    filters = driver.execute_script("return [...document.querySelectorAll('.filterName a')].map(a => a.innerText)")
    

    【讨论】:

      【解决方案2】:

      看来您需要使用硒。问题是下拉列表在您单击它之前不会显示剩余的选项。

      from selenium import webdriver
      from selenium.webdriver.chrome.options import Options
      from bs4 import BeautifulSoup
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.support import expected_conditions as EC
      
      options = Options()
      driver = webdriver.Chrome(options=options)
      driver.get('https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html')
      
      WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, '//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span')))
      
      
      driver.execute_script("arguments[0].scrollIntoView();", driver.find_element_by_xpath('//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span'))
      driver.execute_script("arguments[0].click();", driver.find_element_by_xpath('//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span'))
      
      
      html = driver.page_source
      soup = BeautifulSoup(html, 'lxml')
      
      items = soup.findAll('a', {'class':'attractions-attraction-overview-main-Pill__pill--23S2Q'})
      #You could use this to not just get text but also the ['href'] too. 
      
      for item in items:
          print(item.get_text())
      
      
      driver.quit()
      

      【讨论】:

        【解决方案3】:

        我认为您需要能够单击显示更多以查看所有可用的。所以使用硒之类的东西。这包括等待以确保所有元素都存在以及下拉菜单可点击。

        from selenium import webdriver
        from selenium.webdriver.support.ui import WebDriverWait
        from selenium.webdriver.support import expected_conditions as EC
        from selenium.webdriver.common.by import By
        
        d = webdriver.Chrome()
        d.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")
        WebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a")))
        WebDriverWait(d, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#taplc_attraction_filters_clarity_0 span.ui_icon.caret-down"))).click()
        tag_elements = WebDriverWait(d,5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".filter_list_0 div a")))
        tags_list = [i.text for i in tag_elements]
        print(tags_list)
        d.quit()
        


        没有硒我只能得到 15 个项目

        import requests
        from bs4 import BeautifulSoup
        
        user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"
        headers = {'User-Agent': user_agent}
        new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
        response = requests.get(new_url, headers = headers)
        soup = BeautifulSoup(response.content, "lxml")
        tag_elements = soup.select('#component_3 > div > div > div:nth-of-type(12) > div:nth-of-type(1) > div > div a')
        
        tags_list = [i.text for i in tag_elements]       
        print(tags_list)
        

        【讨论】:

        • WebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a"))) 行导致TimeoutException: Message: 不显示任何消息。我将时间更改为 10 和 20,但结果相同。
        • 奇数。如果您注释掉该行并将下一行的等待时间增加到 10,会发生什么?您可以随时在下拉菜单中执行_script,否则可以移动内容。
        【解决方案4】:

        仅获取 Types of Things to Do in Miami 标头中的内容有点棘手。为此,您需要像我在下面那样以有组织的方式定义选择器。以下脚本应单击上述标题下的See all 按钮。发起点击后,脚本会解析您要查找的相关内容:

        from selenium import webdriver
        from selenium.webdriver.support import ui
        from bs4 import BeautifulSoup
        
        driver = webdriver.Chrome()
        wait = ui.WebDriverWait(driver, 10)
        driver.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")
        
        show_more = wait.until(lambda driver: driver.find_element_by_css_selector("[class='ui_container'] div:nth-of-type(1) .caret-down"))
        driver.execute_script("arguments[0].click();",show_more)
        soup = BeautifulSoup(driver.page_source,"lxml")
        items = [item.text for item in soup.select("[class='ui_container'] div:nth-of-type(1) a[href^='/Attractions-']")]
        print(items)   
        driver.quit()
        

        它产生的输出:

        ['Tours (277)', 'Outdoor Activities (255)', 'Boat Tours & Water Sports (184)', 'Shopping (126)', 'Nightlife (126)', 'Spas & Wellness (109)', 'Fun & Games (67)', 'Transportation (66)', 'Museums (61)', 'Sights & Landmarks (54)', 'Nature & Parks (54)', 'Food & Drink (27)', 'Concerts & Shows (25)', 'Classes & Workshops (22)', 'Zoos & Aquariums (7)', 'Traveler Resources (7)', 'Day Trips (7)', 'Water & Amusement Parks (5)', 'Casinos & Gambling (3)', 'Events (2)']
        

        【讨论】:

          猜你喜欢
          • 2018-05-31
          • 1970-01-01
          • 1970-01-01
          • 2015-06-25
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2023-03-13
          相关资源
          最近更新 更多