【问题标题】:Data Scraping from Vivino.com - wine information and reviews来自 Vivino.com 的数据抓取 - 葡萄酒信息和评论
【发布时间】:2021-09-22 15:31:54
【问题描述】:

为了写我的硕士论文,我需要收集数据。现在,我想从 Vivino.com 收集数据,但我没有任何网络抓取经验。我已经看到了一些关于此的问题,但我想收集有关葡萄酒的所有信息(名称、国家、评级、描述、价格等)和葡萄酒的评论。

import requests
import pandas as pd

r = requests.get(
    "https://www.vivino.com/api/explore/explore",
    params = {
        "country_code": "FR",
        "country_codes[]":"pt",
        "currency_code":"EUR",
        "grape_filter":"varietal",
        "min_rating":"1",
        "order_by":"price",
        "order":"asc",
        "page": 1,
        "price_range_max":"500",
        "price_range_min":"0",
        "wine_type_ids[]":"1"
    },
    headers= {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"
    }
)
results = [
    (
        t["vintage"]["wine"]["winery"]["name"], 
        f'{t["vintage"]["wine"]["name"]} {t["vintage"]["year"]}',
        t["vintage"]["statistics"]["ratings_average"],
        t["vintage"]["statistics"]["ratings_count"]
    )
    for t in r.json()["explore_vintage"]["matches"]
]
dataframe = pd.DataFrame(results,columns=['Winery','Wine','Rating','num_review'])

print(dataframe)

使用此代码,我可以收集 ['Winery' 'Wine' 'Rating' 'num_review']

通过以下代码,我可以收集评论:

import re
import json
import requests

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
}


url = "https://www.vivino.com/FR/en/dauprat-pauillac/w/3823873?year=2017&price_id=24797287"
api_url = (
    "https://www.vivino.com/api/wines/{id}/reviews?per_page=9999&year={year}"
) # <-- increased the number of reviews to 9999

id_ = re.search(r"/(\d{5,})", url).group(1)
year = re.search(r"year=(\d+)", url).group(1)

data = requests.get(api_url.format(id=id_, year=year), headers=headers).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for r in data["reviews"]:
    print(r["note"])
    print("-" * 80)

有人可以帮助我如何合并所有这些信息吗?那么,包括相应评论在内的所有葡萄酒信息?

提前谢谢你!!

【问题讨论】:

    标签: python pandas web-scraping web-scraping-language


    【解决方案1】:

    要从第一个数据框中获取有关葡萄酒的所有评论,您可以使用下一个示例:

    import requests
    import pandas as pd
    
    
    def get_wine_data(wine_id, year, page):
        headers = {
            "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
        }
    
        api_url = "https://www.vivino.com/api/wines/{id}/reviews?per_page=50&year={year}&page={page}"  # <-- increased the number of reviews to 9999
    
        data = requests.get(
            api_url.format(id=wine_id, year=year, page=page), headers=headers
        ).json()
    
        return data
    
    
    r = requests.get(
        "https://www.vivino.com/api/explore/explore",
        params={
            "country_code": "FR",
            "country_codes[]": "pt",
            "currency_code": "EUR",
            "grape_filter": "varietal",
            "min_rating": "1",
            "order_by": "price",
            "order": "asc",
            "page": 1,
            "price_range_max": "500",
            "price_range_min": "0",
            "wine_type_ids[]": "1",
        },
        headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"
        },
    )
    
    results = [
        (
            t["vintage"]["wine"]["winery"]["name"],
            t["vintage"]["year"],
            t["vintage"]["wine"]["id"],
            f'{t["vintage"]["wine"]["name"]} {t["vintage"]["year"]}',
            t["vintage"]["statistics"]["ratings_average"],
            t["vintage"]["statistics"]["ratings_count"],
        )
        for t in r.json()["explore_vintage"]["matches"]
    ]
    dataframe = pd.DataFrame(
        results,
        columns=["Winery", "Year", "Wine ID", "Wine", "Rating", "num_review"],
    )
    
    ratings = []
    for _, row in dataframe.iterrows():
        page = 1
        while True:
            print(
                f'Getting info about wine {row["Wine ID"]}-{row["Year"]} Page {page}'
            )
    
            d = get_wine_data(row["Wine ID"], row["Year"], page)
    
            if not d["reviews"]:
                break
    
            for r in d["reviews"]:
                ratings.append(
                    [
                        row["Year"],
                        row["Wine ID"],
                        r["rating"],
                        r["note"],
                        r["created_at"],
                    ]
                )
    
            page += 1
    
    ratings = pd.DataFrame(
        ratings, columns=["Year", "Wine ID", "User Rating", "Note", "CreatedAt"]
    )
    
    df_out = ratings.merge(dataframe)
    df_out.to_csv("data.csv", index=False)
    

    创建 data.csv(约 4 万条评论)(来自 LibreOffice 的屏幕截图):

    【讨论】:

    • 非常感谢您的帮助!!惊人的!我还有一个问题,因为我想添加葡萄酒的价格、酿酒厂的国家或地区、葡萄酒类型(红葡萄酒、白葡萄酒等),如果可能的话,还要加上每种葡萄酒的特征。如何添加该代码?有没有一种简单的方法可以将所有评论都写成英文?
    • 你有没有想过如何退货@BMoeskops?
    猜你喜欢
    • 1970-01-01
    • 2010-12-19
    • 2023-04-02
    • 2019-02-01
    • 2016-01-27
    • 2017-05-01
    • 2012-05-04
    • 2020-09-24
    • 1970-01-01
    相关资源
    最近更新 更多