如何使用 Bs4 刮取结果卡内的页面？答案

【问题标题】：how to scrape page inside the result card using Bs4?如何使用 Bs4 刮取结果卡内的页面？
【发布时间】：2022-01-17 15:35:57
【问题描述】：

<img class="no-img" data-src="https://im1.dineout.co.in/images/uploads/restaurant/sharpen/4/h/u/p4059-15500352575c63a9394c209.jpg?tr=tr:n-medium" alt="Biryani By Kilo" data-gatype="RestaurantImageClick" data-url="/delhi/biryani-by-kilo-connaught-place-central-delhi-40178" data-w-onclick="cardClickHandler" src="https://im1.dineout.co.in/images/uploads/restaurant/sharpen/4/h/u/p4059-15500352575c63a9394c209.jpg?tr=tr:n-medium">

页面网址 - https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p=1

这个页面现在包含一些餐厅卡，同时在循环中报废页面我想进入data-url类在上面的HTML代码名称中的餐厅卡URL并刮掉编号。里面的评论，我不知道该怎么做，我当前的正常首页报废代码是；

def extract(page):
    url = f"https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p={page}"  # URL of the website 
    header = {'User-Agent':'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'} # Temporary user agent
    r = requests.get(url, headers=header)
    soup = BeautifulSoup(r.content, 'html.parser')
    return soup

def transform(soup): # function to scrape the page
    divs = soup.find_all('div', class_ = 'restnt-card restaurant')
    for item in divs:
        title = item.find('a').text.strip() # restaurant name
        loc = item.find('div', class_ = 'restnt-loc ellipsis').text.strip() # restaurant location
        try: # used this try and except method because some restaurants are unrated and while scrpaping those we would run into an error
            rating = item.find('div', class_="img-wrap").text 
            rating = (re.sub("[^0-9,.]", "", rating))
            
        except:
            rating = None
        pricce = item.find('span', class_="double-line-ellipsis").text.strip() # price for biriyani
        price = re.sub("[^0-9]", "", pricce)[:-1]

        biry_del = {
            'name': title,
            'location': loc,
            'rating': rating,
            'price': price
        }
        rest_list.append(biry_del)

        
rest_list = []

for i in range(1,18):
    print(f'getting page, {i}')
    c = extract(i)
    transform(c)

希望大家理解，如有困惑请在评论中提问。

【问题讨论】：

不知道为什么，但对于首页上的所有 21 家餐厅 - 基于 232 票的评分为 4.3...
不，这不是我在这里看到的。

标签： python web-scraping beautifulsoup

【解决方案1】：

它不是很快，但如果你点击这个后端 api 端点，看起来你可以获得你想要的所有细节，包括评论计数（不是 232！）： https://www.dineout.co.in/get_rdp_data_main/delhi/69676/restaurant_detail_main

import requests
from bs4 import BeautifulSoup
import pandas as pd

rest_list = []
for page in range(1,3):
    print(f'getting page, {page}')

    s = requests.Session()

    url = f"https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p={page}"  # URL of the website
    header = {'User-Agent':'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'} # Temporary user agent
    r = s.get(url, headers=header)
    soup = BeautifulSoup(r.content, 'html.parser')

    divs = soup.find_all('div', class_ = 'restnt-card restaurant')

    for item in divs:
        code = item.find('a')['href'].split('-')[-1] # restaurant code
        print(f'Getting details for {code}')
        data = s.get(f'https://www.dineout.co.in/get_rdp_data_main/delhi/{code}/restaurant_detail_main').json()

        info = data['header']
        info.pop('share') #clean up csv
        info.pop('options')
        rest_list.append(info)

df = pd.DataFrame(rest_list)
df.to_csv('dehli_rest.csv',index=False)

【讨论】：

你能告诉我如何在其他网站上做同样的事情吗？
在您的浏览器中打开开发者工具 - 网络 - fetch/Xhr 然后重新加载站点，点击查看会发生什么，您应该会看到网络请求到后端 API 来获取数据。并非每个网站都使用这种技术，但大多数网站确实发生了一些可以像这样重新创建的事情。在此示例中，您需要正确的 cookie 来向该端点发出请求，因此我必须使用 requests.Session() 来帮助我做到这一点。