【发布时间】:2018-11-23 20:58:45
【问题描述】:
从this 页面,我想抓取“迈阿密要做的事情类型”列表(您可以在页面末尾附近找到它)。到目前为止,这是我所拥有的:
import requests
from bs4 import BeautifulSoup
# Define header to prevent errors
user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"
headers = {'User-Agent': user_agent}
new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
# Get response from url
response = requests.get(new_url, headers = headers)
# Encode response for parsing
html = response.text.encode('utf-8')
# Soupify response
soup = BeautifulSoup(html, "lxml")
tag_elements = soup.findAll("a", {"class":"attractions-attraction-overview-main-Pill__pill--23S2Q"})
# Iterate over tag_elements and exctract strings
tags_list = []
for i in tag_elements:
tags_list.append(i.string)
问题是,我得到像'Good for Couples (201)', 'Good for Big Groups (130)', 'Good for Kids (100)' 这样的值,这些值来自页面的“事物类型...”部分下方的页面“迈阿密常用搜索”区域。我也没有得到我需要的一些值,比如"Traveler Resources (7)", "Day Trips (7)" 等。这两个列表“要做的事情......”和“常用搜索......”的类名是相同的,我正在使用类在soup.findAll() 我猜这可能是这个问题的原因。这样做的正确方法是什么?我应该采取其他方法吗?
【问题讨论】:
标签: python web-scraping beautifulsoup tripadvisor