来自 Vivino.com 的数据抓取 - 葡萄酒信息和评论答案

【问题标题】：Data Scraping from Vivino.com - wine information and reviews来自 Vivino.com 的数据抓取 - 葡萄酒信息和评论
【发布时间】：2021-09-22 15:31:54
【问题描述】：

为了写我的硕士论文，我需要收集数据。现在，我想从 Vivino.com 收集数据，但我没有任何网络抓取经验。我已经看到了一些关于此的问题，但我想收集有关葡萄酒的所有信息（名称、国家、评级、描述、价格等）和葡萄酒的评论。

import requests
import pandas as pd

r = requests.get(
    "https://www.vivino.com/api/explore/explore",
    params = {
        "country_code": "FR",
        "country_codes[]":"pt",
        "currency_code":"EUR",
        "grape_filter":"varietal",
        "min_rating":"1",
        "order_by":"price",
        "order":"asc",
        "page": 1,
        "price_range_max":"500",
        "price_range_min":"0",
        "wine_type_ids[]":"1"
    },
    headers= {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"
    }
)
results = [
    (
        t["vintage"]["wine"]["winery"]["name"], 
        f'{t["vintage"]["wine"]["name"]} {t["vintage"]["year"]}',
        t["vintage"]["statistics"]["ratings_average"],
        t["vintage"]["statistics"]["ratings_count"]
    )
    for t in r.json()["explore_vintage"]["matches"]
]
dataframe = pd.DataFrame(results,columns=['Winery','Wine','Rating','num_review'])

print(dataframe)

使用此代码，我可以收集 ['Winery' 'Wine' 'Rating' 'num_review']

通过以下代码，我可以收集评论：

import re
import json
import requests

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
}


url = "https://www.vivino.com/FR/en/dauprat-pauillac/w/3823873?year=2017&price_id=24797287"
api_url = (
    "https://www.vivino.com/api/wines/{id}/reviews?per_page=9999&year={year}"
) # <-- increased the number of reviews to 9999

id_ = re.search(r"/(\d{5,})", url).group(1)
year = re.search(r"year=(\d+)", url).group(1)

data = requests.get(api_url.format(id=id_, year=year), headers=headers).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for r in data["reviews"]:
    print(r["note"])
    print("-" * 80)

有人可以帮助我如何合并所有这些信息吗？那么，包括相应评论在内的所有葡萄酒信息？

提前谢谢你！！

【问题讨论】：

标签： python pandas web-scraping web-scraping-language

【解决方案1】：

要从第一个数据框中获取有关葡萄酒的所有评论，您可以使用下一个示例：

import requests
import pandas as pd


def get_wine_data(wine_id, year, page):
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
    }

    api_url = "https://www.vivino.com/api/wines/{id}/reviews?per_page=50&year={year}&page={page}"  # <-- increased the number of reviews to 9999

    data = requests.get(
        api_url.format(id=wine_id, year=year, page=page), headers=headers
    ).json()

    return data


r = requests.get(
    "https://www.vivino.com/api/explore/explore",
    params={
        "country_code": "FR",
        "country_codes[]": "pt",
        "currency_code": "EUR",
        "grape_filter": "varietal",
        "min_rating": "1",
        "order_by": "price",
        "order": "asc",
        "page": 1,
        "price_range_max": "500",
        "price_range_min": "0",
        "wine_type_ids[]": "1",
    },
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"
    },
)

results = [
    (
        t["vintage"]["wine"]["winery"]["name"],
        t["vintage"]["year"],
        t["vintage"]["wine"]["id"],
        f'{t["vintage"]["wine"]["name"]} {t["vintage"]["year"]}',
        t["vintage"]["statistics"]["ratings_average"],
        t["vintage"]["statistics"]["ratings_count"],
    )
    for t in r.json()["explore_vintage"]["matches"]
]
dataframe = pd.DataFrame(
    results,
    columns=["Winery", "Year", "Wine ID", "Wine", "Rating", "num_review"],
)

ratings = []
for _, row in dataframe.iterrows():
    page = 1
    while True:
        print(
            f'Getting info about wine {row["Wine ID"]}-{row["Year"]} Page {page}'
        )

        d = get_wine_data(row["Wine ID"], row["Year"], page)

        if not d["reviews"]:
            break

        for r in d["reviews"]:
            ratings.append(
                [
                    row["Year"],
                    row["Wine ID"],
                    r["rating"],
                    r["note"],
                    r["created_at"],
                ]
            )

        page += 1

ratings = pd.DataFrame(
    ratings, columns=["Year", "Wine ID", "User Rating", "Note", "CreatedAt"]
)

df_out = ratings.merge(dataframe)
df_out.to_csv("data.csv", index=False)

创建 data.csv（约 4 万条评论）（来自 LibreOffice 的屏幕截图）：

【讨论】：

非常感谢您的帮助！！惊人的！我还有一个问题，因为我想添加葡萄酒的价格、酿酒厂的国家或地区、葡萄酒类型（红葡萄酒、白葡萄酒等），如果可能的话，还要加上每种葡萄酒的特征。如何添加该代码？有没有一种简单的方法可以将所有评论都写成英文？
你有没有想过如何退货@BMoeskops？