抓取网页的所有 URL答案

【问题标题】：Scrape all URLs of a webpage抓取网页的所有 URL
【发布时间】：2021-10-10 10:42:38
【问题描述】：

我有以下网址https://www.gbgb.org.uk/greyhound-profile/?greyhoundId=517801，其中最后 6 位数字是特定跑步者的唯一标识符。我想在此页面上查找所有 6 位唯一标识符。

我试图抓取页面上的所有网址（代码如下所示），但不幸的是我只得到了一个高级摘要。而不是应该包含> 5000名跑步者的深度列表。我希望得到一个列表/数据框，其中显示：

等等

这是我迄今为止能够做到的。感谢您的帮助！

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request("https://www.gbgb.org.uk//greyhound-profile//")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "lxml")

links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))

print(links)

提前感谢您的帮助！

【问题讨论】：

您访问的网页(https://www.gbgb.org.uk//greyhound-profile//)没有...?greyhoundId=xxxxxx的url
这很奇怪，因为gbgb.org.uk/greyhound-profile/?greyhoundId=517801 肯定是一个页面。此外，当我使用我的代码时，它会获取所有高级 URL，即“gbgb.org.uk/about”和“gbgb.org.uk/welfare-care”。知道我需要做什么才能深入了解gbgb.org.uk/greyhound-profile/?greyhoundId=xxxxxx
什么是“高级摘要”？您是否 100% 确定您获得了带有 requests 的真实呈现的网页？
这是我的列表形式的结果的 sn-p。 'gbgb.org.uk'、'gbgb.org.uk/about'、'gbgb.org.uk/welfare-care'、'gbgb.org.uk/racing'、'gbgb.org.uk/rules-regulation'、'#search'、'gbgb.org.uk/my-kennel'、'gbgb.org.uk/about/about-us。我需要获取所有 gbgb.org.uk/greyhound-profile/?greyhoundId=xxxxxx 其中“xxxxxx”是 6 整数唯一标识符。谢谢
为什么不在 for 循环中尝试所有 6 位整数？

标签： python selenium url web-scraping beautifulsoup

【解决方案1】：

您可以将结果内容转换为 pandas 数据框，然后只需使用 winnerOr2ndName 和 winnerOr2ndId 列

例子

import json
import requests
import pandas as pd

def get_items(dog_id):
    url = f"https://api.gbgb.org.uk/api/results/dog/{dog_id}?page=-1"
    params = {"page": "-1", "itemsPerPage": "20", "race_type": "race"}
    response = requests.get(url, params=params).json()
    MAX_PAGES = response["meta"]["pageCount"]
    result = pd.DataFrame(pd.DataFrame(response["items"]).loc[:, ['winnerOr2ndName','winnerOr2ndId']].dropna())
    result["winnerOr2ndId"] = result["winnerOr2ndId"].astype(int)
    
    while int(params.get("page"))<MAX_PAGES:
        params["page"] = str(int(params.get("page")) + 1)
        response = requests.get(url, params=params).json()
        new_items = pd.DataFrame(pd.DataFrame(response["items"]).loc[:, ['winnerOr2ndName','winnerOr2ndId']].dropna())
        new_items["winnerOr2ndId"] = new_items["winnerOr2ndId"].astype(int)
        result = pd.concat([result, new_items])
    
    return result.drop_duplicates()

它会生成一个如下所示的数据框：

【讨论】：

【解决方案2】：

数据是从外部 API URL 动态加载的。您可以使用下一个示例如何加载数据（使用 ID）：

import json
import requests


api_url = "https://api.gbgb.org.uk/api/results/dog/517801"  # <-- 517801 is the ID from your URL in the question
params = {"page": "1", "itemsPerPage": "20", "race_type": "race"}

page = 1
while True:
    params["page"] = page
    data = requests.get(api_url, params=params).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    if not data["items"]:
        break

    for i in data["items"]:
        print(
            "{:<30} {}".format(
                i.get("winnerOr2ndName", ""), i.get("winnerOr2ndId", "")
            )
        )

    page += 1

打印：

Ferndale Boom                  534358
Laganore Mustang               543937
Tickity Kara                   535237
Thor                           511842
Ballyboughlewiss               519556
Beef Cakes                     551323
Distant Millie                 546674
Lissan Kels                    525148
Rosstemple Marko               534276
Happy Harry                    550042
Porthall Ella                  550841
Southlodge Eden                531677
Effernogue Beef                547416
Faydas Truffle                 528780
Johns Lass                     538763
Faydas Truffle                 528780
Toms Hero                      543659
Affane Buzz                    547555
Emkay Flyer                    531456
Ballymac Tilly                 492923
Kilcrea Duke                   542178
Sporting Sultan                541880
Droopys Poet                   542020
Shortwood Elle                 527241
Rosstemple Marko               534276
Erics Bozo                     541863
Swift Launch                   536667
Longsearch                     523017
Swift Launch                   536667
Takemyhand                     535023
Floral Print                   527192
Rustys Aero                    497270
Autumn Dapper                  519528
Droopys Kiwi                   511989
Deep Chest                     520634
Newtack Henry                  525511
Indian Nightmare               524636
Lady Mascara                   528399
Tarsna Yankee                  517373
                               
Leathems Act                   516918
Final Star                     514015
Ascot Faye                     500812
Ballymac Ernie                 503569

【讨论】：