【问题标题】:Python html parsing partial class namesPython html解析部分类名
【发布时间】:2020-04-10 10:45:20
【问题描述】:

我正在尝试使用 bs4 解析网页,但我尝试访问的元素都有不同的类名。 示例:class='list-item Listing ... id-12984' 和 class='list-item Listing ... id-10359'

def preownedaston(url):
    preownedaston_resp = requests.get(url)

    if preownedaston_resp.status_code == 200:
        bs = BeautifulSoup(preownedaston_resp.text, 'lxml')
        posts = bs.find_all('div', class_='') #don't know what to put here
        for p in posts:
            title_year = p.find('div', class_='inset').find('a').find('span', class_='model_year').text
            print(title_year)

preownedaston('https://preowned.astonmartin.com/preowned-cars/search/?finance%5B%5D=price&price-currency%5B%5D=EUR&custom-model%5B404%5D%5B%5D=809&continent-country%5B%5D=France&postcode-area=United%20Kingdom&distance%5B%5D=0&transmission%5B%5D=Manual&budget-program%5B%5D=pay&section%5B%5D=109&order=-usd_price&pageId=3760')

有没有办法解析像class_='list-item '这样的部分类名?

【问题讨论】:

  • 我认为您的代码甚至没有达到您要查找的元素甚至存在的程度。查看该页面的源代码:view-source:https://preowned.astonmartin.com/preowned-cars/search/?finance%5B%5D=price&price-currency%5B%5D=EUR&custom-model%5B404%5D%5B%5D=809&continent-country%5B%5D=France&postcode-area=United%20Kingdom&distance%5B%5D=0&transmission%5B%5D=Manual&budget-program%5B%5D=pay&section%5B%5D=109&order=-usd_price&pageId=3760。我在那个源代码中找不到任何list-item,Beautifulsoup 也不会。
  • @Tomalak 我发现两个 div 代表“帖子”(每辆车一个),每个 div 都有一个类似于:'list-item listing usedVehiclesSearch usedvehicles usedcars make-aston-martin model- v12-vantage reg-s00754 location-3d81f6e3a2cfd67ead2b23e36fab68948d711d43 h-3d81f6e3a2cfd67ead2b23e36fab68948d711d43 aston-martin-bordeaux franchise-628fa2b4b3ef528010bde94a132f98717eb30c45 h-628fa2b4b3ef528010bde94a132f98717eb30c45 id-10359'
  • 当您查看页面的源代码时不会。您正在浏览器的开发工具中查看实时 DOM,这是一个完全不同的事情。 Beautifulsoup 不会看到这一点,因为所有这些都是 Javascript 生成的,而 Beautifulsoup 不运行任何 Javascript。使用 view-source: 链接(复制和粘贴)查看 Beautifulsoup 会看到什么。
  • @Tomalak 谢谢你,我不知道你需要源代码直接解析页面,因为我以前的所有脚本都使用 DOM。我设法找到每辆车的详细信息在哪里,但是如何使用 bs4 访问它们? (抱歉还是很菜)
  • @Tomalak 没关系,其他答案解释得很好,谢谢你的时间

标签: python parsing beautifulsoup


【解决方案1】:

用于匹配某个属性的部分值的Css Selector如下:

div[class*='list-item'] # the * means match the class with this partial value 

但是如果你查看页面的源代码,你会发现你试图抓取的内容是由 Javascript 生成的,所以这里有三个选项

  1. 使用 Selenium 和无头浏览器来呈现 javescript
  2. 查找 Ajax 调用并尝试模拟它们,例如此 url 是网站用于检索数据的 ajax 调用Ajax URL
  3. 如下所示查找您尝试抓取到脚本标签中的数据:

在类似的情况下我更喜欢这个,因为你将解析 Json

import requests , json 
from bs4 import BeautifulSoup
URL = 'https://preowned.astonmartin.com/preowned-cars/search/?finance%5B%5D=price&price-currency%5B%5D=EUR&custom-model%5B404%5D%5B%5D=809&continent-country%5B%5D=France&postcode-area=United%20Kingdom&distance%5B%5D=0&transmission%5B%5D=Manual&budget-program%5B%5D=pay&section%5B%5D=109&order=-usd_price&pageId=3760'

page = requests.get(URL, headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"})
soup = BeautifulSoup(page.text, 'html.parser')
json_obj = soup.find('script',{'type':"application/ld+json"}).text
#{"@context":"http://schema.org","@graph":[{"@type":"Brand","name":""},{"@type":"OfferCatalog","itemListElement":[{"@type":"Offer","name":"Pre-Owned By Aston Martin","price":"€114,900.00","url":"https://preowned.astonmartin.com/preowned-cars/12984-aston-martin-v12-vantage-v8-volante/","itemOffered":{"@type":"Car","name":"Aston Martin V12 Vantage V8 Volante","brand":"Aston Martin","model":"V12 Vantage","itemCondition":"Used","category":"Used","productionDate":"2010","releaseDate":"2011","bodyType":"6.0 Litre V12","emissionsCO2":"388","fuelType":"Obsidian Black","mileageFromOdometer":"42000","modelDate":"2011","seatingCapacity":"2","speed":"190","vehicleEngine":"6l","vehicleInteriorColor":"Obsidian Black","color":"Black"}},{"@type":"Offer","name":"Pre-Owned By Aston Martin","price":"€99,900.00","url":"https://preowned.astonmartin.com/preowned-cars/10359-aston-martin-v12-vantage-carbon-edition-coupe/","itemOffered":{"@type":"Car","name":"Aston Martin V12 Vantage Carbon Edition Coupe","brand":"Aston Martin","model":"V12 Vantage","itemCondition":"Used","category":"Used","productionDate":"2011","releaseDate":"2011","bodyType":"6.0 Litre V12","emissionsCO2":"388","fuelType":"Obsidian Black","mileageFromOdometer":"42000","modelDate":"2011","seatingCapacity":"2","speed":"190","vehicleEngine":"6l","vehicleInteriorColor":"Obsidian Black","color":"Black"}}]},{"@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":"1","item":{"@id":"https://preowned.astonmartin.com/","name":"Homepage"}},{"@type":"ListItem","position":"2","item":{"@id":"https://preowned.astonmartin.com/preowned-cars/","name":"Pre-Owned Cars"}},{"@type":"ListItem","position":"3","item":{"@id":"//preowned.astonmartin.com/preowned-cars/search/","name":"Pre-Owned By Aston Martin"}}]}]}
items = json.loads(json_obj)['@graph'][1]['itemListElement']
for item in items :
    print(item['itemOffered']['name'])

输出:

Aston Martin V12 Vantage V8 Volante
Aston Martin V12 Vantage Carbon Edition Coupe

【讨论】:

    【解决方案2】:

    来自此 URL 的信息实际上以 JSON 格式返回,这意味着您可以轻松提取所需的字段。例如:

    import requests
    
    url = "https://preowned.astonmartin.com/ajax/stock-listing/get-items/pageId/3760/ratio/3_2/taxBandImageLink/aHR0cHM6Ly9kMnBwMTFwZ29wNWY2cC5jbG91ZGZyb250Lm5ldC9UYXhCYW5kLSV0YXhfYmFuZCUuanBn/taxBandImageHyperlink/JWRlYWxlcl9lbWFpbCU=/imgWidth/767/?finance%5B%5D=price&price-currency%5B%5D=EUR&custom-model%5B404%5D%5B%5D=809&continent-country%5B%5D=France&distance%5B%5D=0&transmission%5B%5D=Manual&budget-program%5B%5D=pay&section%5B%5D=109&order=-usd_price&pageId=3760"
    
    r = requests.get(url)
    data = r.json()
    details = ['make', 'mileage', 'model', 'model_year', 'mpg', 'exterior_colour', 'price_now']
    
    for vehicle in data['vehicles']:
        print()
        for key in details:
            print(f"{key:18} : {vehicle[key]}")
    

    这将显示以下内容:

    make               : Aston Martin
    mileage            : 42,000 km
    model              : V12 Vantage
    model_year         : 2011
    mpg                : 17.3
    exterior_colour    : Carbon Black
    price_now          : €114,900
    
    make               : Aston Martin
    mileage            : 42,000 km
    model              : V12 Vantage
    model_year         : 2011
    mpg                : 17.3
    exterior_colour    : Carbon Black
    price_now          : €99,900
    

    注意:如果没有返回数据,可能需要添加用户代理请求标头。如果您显示data,您可以看到每辆车的所有可用信息。

    这种方法避免了通过 Selenium 进行 javascript 处理的需要,也避免了使用 BeautifulSoup 解析任何 HTML 的需要。该 URL 是在页面加载时使用浏览器的网络工具找到的。

    【讨论】:

    • 感谢您的精彩回答!
    猜你喜欢
    • 2014-10-26
    • 2017-11-04
    • 1970-01-01
    • 2010-12-28
    • 2021-12-21
    • 2020-04-17
    • 1970-01-01
    • 2011-06-23
    • 1970-01-01
    相关资源
    最近更新 更多