【问题标题】:Using API Endpoint to Scrape使用 API 端点抓取
【发布时间】:2021-03-23 20:50:04
【问题描述】:

This 网站在我使用 selenium 自动滚动加载几千个结果后冻结。有没有办法使用 API 端点来抓取 (1) 葡萄酒的名称及其 (2) 评级、(3) 价格和 (4) 使用的葡萄类型?谢谢!

以下代码仅获取这 4 个标准的葡萄酒,但仅来自 一个 国家...有没有办法调整它以返回类型 131 的葡萄(葡萄类型的代码称为 ' bobal') 来自所有个国家?

import requests
import math
import pandas as pd

s = requests.Session()
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
s.get('https://www.vivino.com/', headers=headers)

cookies = s.cookies.get_dict()

cookieStr = ''
for k,v in cookies.items():
    cookieStr += k+'='+v+';'

url = 'https://www.vivino.com/api/explore/explore'
payload = {
'grape_ids[]':'131',
'grape_filter': 'varietal',
'min_rating': '1',
'order_by': 'discount_percent',
'order': 'desc',
'page': '1',
'per_page': '100',
'price_range_max': '40',
'price_range_min': '5'}

headers.update({'cookie':cookieStr})

jsonData = requests.get(url, params=payload, headers=headers).json()
total_pages = math.ceil(jsonData['explore_vintage']['records_matched'] / 100)

rows = []
for page in range(1,total_pages+1):
    if page != 1:
        payload.update({'page':page})
        jsonData = requests.get(url, params=payload, headers=headers).json()
    for each in jsonData['explore_vintage']['records']:
        name = each['vintage']['name']
        rating =  each['vintage']['statistics']['ratings_average']
        price = each['price']['amount']
        
        row = {'name':name, 'rating':rating, 'price':price}
        rows.append(row)
    print('Aquired page: %s' %page)

df = pd.DataFrame(rows)
display(df)

【问题讨论】:

    标签: python python-3.x selenium web-scraping beautifulsoup


    【解决方案1】:

    不太确定我是否弄明白了。但这会返回 856 种葡萄酒

    import requests
    import math
    import re
    import pandas as pd
    from bs4 import BeautifulSoup
    
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"}
    url = 'https://www.vivino.com/'
    
    # Get Cache key to get country codes and type of wines
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    script = soup.find('script', text = re.compile('var vivinoCacheKey'))
    vivinoCacheKey = str(script).split('vivinoCacheKey = ')[-1].split(';')[0].replace("'",'').strip()
    
    # Get countries
    api_url = 'https://www.vivino.com/api/countries'
    payload = {
        'cache_key':vivinoCacheKey}
    countryData = requests.get(api_url, headers=headers, params=payload).json()['countries']
    
    
    rows = []
    # Iterate through countries and wine types
    api_url = 'https://www.vivino.com/api/explore/explore'
    for country in countryData:
        payload = {
        "country_code": country['code'].upper(),
        "currency_code":country['currency']['code'],
        'grape_ids[]':'131',
        "grape_filter":"varietal",
        "min_rating":"1",
        "order_by":"ratings_count",
        "order":"desc",
        "page": '1',
        "price_range_max":"1000",
        "price_range_min":"1"}
    
        try:
            jsonData = requests.get(api_url, params=payload, headers=headers).json()
            total_pages = math.ceil(jsonData['explore_vintage']['records_matched'] / 100)
            #print('%s' %(country['code'].upper()))
            
            for page in range(1,total_pages+1):
                if page != 1:   
                    payload.update({'page':page})
                jsonData = requests.get(api_url, params=payload, headers=headers).json()
                for each in jsonData['explore_vintage']['records']:
                    name = each['vintage']['name']
                    rating =  each['vintage']['statistics']['ratings_average']
                    price = each['price']['amount']
                    
                    row = {'name':name, 'rating':rating, 'price':price}
                    rows.append(row)
                print('Aquired page: %s - %s ' %(country['code'].upper(), page))
        except:
            continue
    
    df = pd.DataFrame(rows)
    

    输出:

    print(df)
                                                  name  rating   price
    0                 Mustiguillo Finca Terrerazo 2017     4.2   30.83
    1              Beso de Rechenna Bobal Crianza 2016     3.6   10.16
    2       Bruno Murciano Cambio de Tercio Bobal 2019     3.8   12.70
    3                  Mustiguillo Quincha Corral 2016     4.4  106.35
    4     Finca Sandoval Signo Bobal de Manchuela 2008     3.7   48.91
    ..                                             ...     ...     ...
    851               Mustiguillo Finca Terrerazo 2016     4.1   20.88
    852                              Pasión Bobal 2017     3.8   12.00
    853  Chozas Carrascal Las 2 Ces Barrica Tinto 2012     3.3    8.00
    854               Mustiguillo Finca Terrerazo 2017     4.2   20.66
    855                           De Moya Justina 2018     3.9    6.48
    
    [856 rows x 3 columns]
    

    这里的另一个选项是,每次您在列表中选择一个国家时,都会创建一个新的会话 cookie。我可以得到第一个,但似乎获得特定国家的唯一方法是使用 Selenium 模拟该选择,然后获取该 cookie。另一件事是,如果您将最低价格设为 0,则该网站的设计目的是不提供价格。不知道他们为什么这样做。

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import requests
    import time
    import math
    import pandas as pd
    
    url = "https://www.vivino.com/explore"
    driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
    driver.maximize_window()
    driver.get(url)
    
    # If Cookie Notice pop up, then click on OK
    if driver.find_element_by_xpath('//div[contains(@class, "cookieNotice")]').size != 0:
        driver.find_element_by_xpath('//div[contains(@class, "cookieNotice")]//button').click()
    
    # Slect Dropdown menu
    driver.find_element_by_xpath('//div[contains(@class, "simpleLabel__selectedKey")]').click()
    
    # Click on United States and wait for page to render
    driver.find_element_by_xpath("//a[@data-value='US']").click()
    time.sleep(5)
    
    cookies_list = driver.get_cookies()
    cookieStr = ''
    for each in cookies_list:
        cookieStr += each['name'] + '=' + each['value'] + ';'
    
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
               'cookie':cookieStr}
    rows = []
    # Iterate through countries and wine types
    api_url = 'https://www.vivino.com/api/explore/explore'
    payload = {
        "country_code": 'US',
        "currency_code": 'USD',
        'grape_ids[]':'131',
        "grape_filter":"varietal",
        "min_rating":"1",
        "order_by":"ratings_count",
        "order":"desc",
        "page": '1',
        "price_range_max":"1000",
        "price_range_min":"1"}
    
    jsonData = requests.get(api_url, params=payload, headers=headers).json()
    total_pages = math.ceil(jsonData['explore_vintage']['records_matched'] / 100)
            
    for page in range(1,total_pages+1):
        if page != 1:   
            payload.update({'page':page})
        jsonData = requests.get(api_url, params=payload, headers=headers).json()
        for each in jsonData['explore_vintage']['records']:
            name = each['vintage']['name']
            rating =  each['vintage']['statistics']['ratings_average']
            try:
                price = each['price']['amount']
            except:
                price = None
            
            row = {'name':name, 'rating':rating, 'price':price}
            rows.append(row)
        print('Aquired page %s of %s ' %(page, total_pages))
    
    
    df = pd.DataFrame(rows)
    

    【讨论】:

    • 嗯。是的。伙计,这个网站正在杀死我。不过我有决心破解它。
    • 我想知道,如果您将货币代码设置为usd,那么它是否会全部返回为美元? "currency_code":'USD',
    【解决方案2】:

    link 在此链接中,您可以看到 546349 种葡萄酒,您可以使用 time.sleep 为每个国家/地区执行您的请求,然后您可以重新加载您的请求与另一个 参数: 阿莱曼哈 阿根廷 澳大利亚 智利 西班牙文 统一标准 弗朗萨 意大利 葡萄牙 奥地利

    time.sleep(2.4)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2012-02-07
      • 1970-01-01
      • 2016-05-14
      • 2011-02-18
      • 1970-01-01
      • 1970-01-01
      • 2019-08-09
      相关资源
      最近更新 更多