如何从 BeautifulSoup 中一个一个地循环浏览抓取的 URL？答案

【问题标题】：How can I loop through scraped URLs one by one from BeautifulSoup?如何从 BeautifulSoup 中一个一个地循环浏览抓取的 URL？
【发布时间】：2022-12-03 13:43:39
【问题描述】：

我不确定是否有字典方法或其他方法，但我正在尝试抓取页面中的所有 URL，然后获取这些 URL 并逐个解析它们以查找相关数据......

为了找到我使用的所有网址......

from bs4 import BeautifulSoup

with open("Movies.html", "r") as page:
    soup = BeautifulSoup(page, "lxml")

for movie_list in soup.find_all('div', class_='movie-item'):
    movie_id = movie_list.div.button['data-movie-id']


    link = movie_list.find('a')['href']
    print('https://test.com' + link)

这给了我...的输出

https://test.com/movie/the-godfather
https://test.com/movie/titanic
https://test.com/movie/interstellar
...

在检索到所有 URL 后，我对如何一次请求一个 URL 感到困惑。

例如请求https://test.com/movie/the-godfather然后查找概要然后https://test.com/movie/titanic做同样的事情

希望您明白要点 :) 提前致谢！

【问题讨论】：

确认您要抓取的网址。
您是否尝试过使用请求库？ pypi.org/project/requests

标签： python web-scraping beautifulsoup

【解决方案1】：

下面是一个示例，说明如何使用 Python 请求库和 BeautifulSoup 从多个 URL 中抓取数据。

首先，您可以使用请求库向每个 URL 发出 GET 请求并检索页面的 HTML 内容。然后，您可以使用 BeautifulSoup 解析 HTML 并提取您感兴趣的数据。这是一个示例：

import requests
from bs4 import BeautifulSoup

# define a function to scrape the data from a single URL
def scrape_data(url):
  # make a GET request to the URL and retrieve the HTML content
  response = requests.get(url)
  html = response.content

  # parse the HTML content using BeautifulSoup
  soup = BeautifulSoup(html, 'lxml')

  # extract the data you are interested in from the page
  synopsis = soup.find('p', class_='synopsis')
  return synopsis.text

# define a list of URLs to scrape
urls = [
  'https://test.com/movie/the-godfather',
  'https://test.com/movie/titanic',
  'https://test.com/movie/interstellar',
]

# loop through the URLs and scrape the data from each one
for url in urls:
  synopsis = scrape_data(url)
  print(synopsis)

此代码将向 urls 列表中的每个 URL 发出 GET 请求，使用 BeautifulSoup 从页面中提取概要数据，并将其打印到控制台。您可以修改此代码以满足您的特定需求并提取您感兴趣的数据。

【讨论】：