【问题标题】:Python Web Scraping / Beautiful Soup, with list of keywords at the end of URLPython Web Scraping / Beautiful Soup,在 URL 末尾带有关键字列表
【发布时间】:2020-10-07 23:20:45
【问题描述】:

我正在尝试构建一个网络爬虫来获取 Vivino.com 上的葡萄酒评论。我有一大串葡萄酒,想搜索一下

url = ("https://www.vivino.com/search/wines?q=")

然后循环遍历列表。抓取评分文本“4.5 - 203 条评论”、葡萄酒名称和页面链接。

我找到了 20 行代码 https://www.kashifaziz.me/web-scraping-python-beautifulsoup.html/ 来构建一个网络爬虫。试图用

编译它
url = ("https://www.vivino.com/search/wines?q=")

#list having the keywords (made by splitting input with space as its delimiter) 
keyword = input().split()

#go through the keywords
for key in keywords :

   #everything else is same logic
   r = requests.get(url + key)

   print("URL :", url+key)
   if 'The specified profile could not be found.' in r.text:
        print("This is available")
   else :
        print('\nSorry that one is taken')

另外,我应该在哪里包含关键字列表?

我很乐意为此提供任何帮助!我正在尝试在 python 方面做得更好,但不确定我是否处于这个水平,哈哈。

谢谢。

【问题讨论】:

    标签: python web-scraping


    【解决方案1】:

    此脚本遍历所选关键字的所有页面并选择标题、价格、评级、评论和指向葡萄酒的链接:

    import re
    import requests
    from time import sleep
    from bs4 import BeautifulSoup
    
    url = 'https://www.vivino.com/search/wines?q={kw}&start={page}'
    prices_url = 'https://www.vivino.com/prices'
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0'}
    
    def get_wines(kw):
        with requests.session() as s:
            page = 1
            while True:
                soup = BeautifulSoup(s.get(url.format(kw=kw, page=page), headers=headers).content, 'html.parser')
    
                if not soup.select('.default-wine-card'):
                    break
    
                params = {'vintages[]': [wc['data-vintage'] for wc in soup.select('.default-wine-card')]}
                prices_js = s.get(prices_url, params=params, headers={
                    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
                    'X-Requested-With': 'XMLHttpRequest',
                    'Accept': 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01'
                    }).text
    
                wine_prices = dict(re.findall(r"\$\('\.vintage-price-id-(\d+)'\)\.find\( '\.wine-price-value' \)\.text\( '(.*?)' \);", prices_js))
    
                for wine_card in soup.select('.default-wine-card'):
                    title = wine_card.select_one('.header-smaller').get_text(strip=True, separator=' ')
                    price = wine_prices.get(wine_card['data-vintage'], '-')
    
                    average = wine_card.select_one('.average__number')
                    average = average.get_text(strip=True) if average else '-'
    
                    ratings = wine_card.select_one('.text-micro')
                    ratings = ratings.get_text(strip=True) if ratings else '-'
    
                    link = 'https://www.vivino.com' + wine_card.a['href']
    
                    yield title, price, average, ratings, link
    
                sleep(3)
                page +=1
    
    kw = 'angel'
    for title, price, average, ratings, link in get_wines(kw):
        print(title)
        print(price)
        print(average + ' / ' + ratings)
        print(link)
        print('-' * 80)
    

    打印:

    Angél ica Zapata Malbec Alta
    -
    4,4 / 61369 ratings
    https://www.vivino.com/wines/1469874
    --------------------------------------------------------------------------------
    Château d'Esclans Whispering Angel Rosé
    16,66
    4,1 / 38949 ratings
    https://www.vivino.com/wines/1473981
    --------------------------------------------------------------------------------
    Angél ica Zapata Cabernet Sauvignon Alta
    -
    4,3 / 27699 ratings
    https://www.vivino.com/wines/1471376
    --------------------------------------------------------------------------------
    
    ... and so on.
    

    编辑:要仅选择一种葡萄酒,您可以将关键字放在列表中,然后循环检查每种葡萄酒:

    import re
    import requests
    from time import sleep
    from bs4 import BeautifulSoup
    
    url = 'https://www.vivino.com/search/wines?q={kw}&start={page}'
    prices_url = 'https://www.vivino.com/prices'
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0'}
    
    def get_wines(kw):
        with requests.session() as s:
            page = 1
            while True:
                soup = BeautifulSoup(s.get(url.format(kw=kw, page=page), headers=headers).content, 'html.parser')
    
                if not soup.select('.default-wine-card'):
                    break
    
                params = {'vintages[]': [wc['data-vintage'] for wc in soup.select('.default-wine-card')]}
                prices_js = s.get(prices_url, params=params, headers={
                    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
                    'X-Requested-With': 'XMLHttpRequest',
                    'Accept': 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01'
                    }).text
    
                wine_prices = dict(re.findall(r"\$\('\.vintage-price-id-(\d+)'\)\.find\( '\.wine-price-value' \)\.text\( '(.*?)' \);", prices_js))
    
                no = 1
                for no, wine_card in enumerate(soup.select('.default-wine-card'), 1):
                    title = wine_card.select_one('.header-smaller').get_text(strip=True, separator=' ')
                    price = wine_prices.get(wine_card['data-vintage'], '-')
    
                    average = wine_card.select_one('.average__number')
                    average = average.get_text(strip=True) if average else '-'
    
                    ratings = wine_card.select_one('.text-micro')
                    ratings = ratings.get_text(strip=True) if ratings else '-'
    
                    link = 'https://www.vivino.com' + wine_card.a['href']
    
                    yield title, price, average, ratings, link
    
                # if no < 20:
                #     break
    
                # sleep(3)
                page +=1
    
    wines = ['10 SPAN VINEYARDS CABERNET SAUVIGNON CENTRAL COAST',
             '10 SPAN VINEYARDS CHARDONNAY CENTRAL COAST']
    
    for wine in wines:
        for title, price, average, ratings, link in get_wines(wine):
            print(title)
            print(price)
            print(average + ' / ' + ratings)
            print(link)
            print('-' * 80)
    

    打印:

    10 Span Vineyards Cabernet Sauvignon
    -
    3,7 / 557 ratings
    https://www.vivino.com/wines/4535453
    --------------------------------------------------------------------------------
    10 Span Vineyards Chardonnay
    -
    3,7 / 150 ratings
    https://www.vivino.com/wines/5815131
    --------------------------------------------------------------------------------
    

    【讨论】:

    • 嘿,谢谢你,这太棒了。我会改变什么来只获得每个关键字的第一个结果?我用 kw = '10 SPAN VINEYARDS CABERNET SAUVIGNON CENTRAL COAST,10 SPAN VINEYARDS CHARDONNAY CENTRAL COAST' 进行测试并得到我想要的,然后它继续前进
    【解决方案2】:
    import requests
    #list having the keywords (made by splitting input with space as its delimiter) 
    keywords = input().split()
    
    #go through the keywords
    for key in keywords :
       url = "https://www.vivino.com/search/wines?q={}".format(key)
       #everything else is same logic
       r = requests.get(url)
    
       print("URL :", url)
       if 'The specified profile could not be found.' in r.text:
            print("This is available")
       else :
            print('\nSorry that one is taken')
    

    对于关键字列表,您可以使用文本文件,在其中每行输入一个关键字

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-09-20
      • 2020-11-02
      • 1970-01-01
      • 1970-01-01
      • 2021-08-22
      • 1970-01-01
      • 2013-01-17
      相关资源
      最近更新 更多