【问题标题】:How to get the value of a "hidden" href?如何获得“隐藏”href 的值?
【发布时间】:2021-12-27 19:22:14
【问题描述】:

我正在使用网络抓取,首先收集总页数。我已经测试了我为另一个网站制作的代码,但是在获取下一页链接 (href) 时遇到问题。

代码如下:

from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests

userName = 'brendanm1975' # just for testing

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

pages = []

with requests.Session() as session:
  page_number = 1
  url = "https://www.last.fm/user/"+userName+"/library/artists?page="
  while True:
      response = session.get(url, headers=headers)
      soup = BeautifulSoup(response.content, 'html.parser')
      pages.append(url)

      next_link = soup.find("li", class_="pagination-next")
      if next_link is None:
        break

      url = urljoin(url, next_link["href"])
      page_number += 1

如您所见,此站点的 href 将链接显示为“?page=2”,这不允许我获取其内容 (https://www.last.fm/user/brendanm1975/library/artists?page=2)。

我已经检查了变量,并且正在获取值。

print(url) # output: https://www.last.fm/user/brendanm1975/library/artists?page=
next_link.find('a').get('href') # output: '?page=2'

有谁知道如何解决这个问题?

【问题讨论】:

  • 也许改用他们的API
  • 通过next_link.find('a').get('href')获取href有什么问题?

标签: python web-scraping python-requests


【解决方案1】:

会发生什么?

您尝试使用urljoin(url, next_link["href"]),但next_link 没有属性href,因为您选择的是<li> 而不是<a>

如何解决?

选项#1 - 只需在您的urljoin() 中选择<a>

url = urljoin(url, next_link.a["href"])

Option#2 - 直接选择<a>

next_link = soup.select_one('li.pagination-next a')

示例

from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests

userName = 'brendanm1975' # just for testing

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

pages = []

with requests.Session() as session:

    url = "https://www.last.fm/user/"+userName+"/library/artists?page=1"
    while True:
        response = session.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        pages.append(url)

        next_link = soup.find("li", class_="pagination-next")
        if next_link is None:
            break

        url = urljoin(url, next_link.a["href"])

输出

['https://www.last.fm/user/brendanm1975/library/artists?page=1',
 'https://www.last.fm/user/brendanm1975/library/artists?page=2',
 'https://www.last.fm/user/brendanm1975/library/artists?page=3',
 'https://www.last.fm/user/brendanm1975/library/artists?page=4',
 'https://www.last.fm/user/brendanm1975/library/artists?page=5',
 'https://www.last.fm/user/brendanm1975/library/artists?page=6',
 'https://www.last.fm/user/brendanm1975/library/artists?page=7',
 'https://www.last.fm/user/brendanm1975/library/artists?page=8',
 'https://www.last.fm/user/brendanm1975/library/artists?page=9',
 'https://www.last.fm/user/brendanm1975/library/artists?page=10',
 'https://www.last.fm/user/brendanm1975/library/artists?page=11',
 'https://www.last.fm/user/brendanm1975/library/artists?page=12',
 'https://www.last.fm/user/brendanm1975/library/artists?page=13',
 'https://www.last.fm/user/brendanm1975/library/artists?page=14',
 'https://www.last.fm/user/brendanm1975/library/artists?page=15',
 'https://www.last.fm/user/brendanm1975/library/artists?page=16',
 'https://www.last.fm/user/brendanm1975/library/artists?page=17',
 'https://www.last.fm/user/brendanm1975/library/artists?page=18',...]

【讨论】:

    猜你喜欢
    • 2014-03-01
    • 1970-01-01
    • 2016-09-09
    • 2011-11-07
    • 2011-06-18
    • 1970-01-01
    • 2016-11-12
    • 2015-08-05
    • 2020-06-08
    相关资源
    最近更新 更多