【发布时间】:2021-12-12 06:12:20
【问题描述】:
#Import Needed Libraries
import requests
from bs4 import BeautifulSoup
import pprint
res = requests.get('https://news.ycombinator.com/news')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.select('.titlelink')
subtext = soup.select('.subtext')
def sort_stories_by_votes(hnlist): #Sorting your create_custom_hn dict by votes(if)
return sorted(hnlist, key= lambda k:k['votes'], reverse=True)
def create_custom_hn(links, subtext): #Creates a list of links and subtext
hn = []
for idx, item in enumerate(links): #Need to use this because not every link has a lot of votes
title = links[idx].getText()
href = links[idx].get('href', None)
vote = subtext[idx].select('.score')
if len(vote):
points = int(vote[0].getText().replace(' points', ''))
if points > 99: #Only appends stories that are over 100 points
hn.append({'title': title, 'link': href, 'votes': points})
return sort_stories_by_votes(hn)
pprint.pprint(create_custom_hn(links, subtext))
我的问题是,这只是第一页,只有 30 个故事。
我将如何通过浏览每个页面来应用我的网络抓取方法......假设接下来的 10 个页面并保留上面的格式化代码?
【问题讨论】:
-
我是否需要将整个代码放入一个范围为 1-20 的 for 循环中?那么使用.format方法呢?
-
您是否尝试过使用 .format 方法将其放入循环中,范围为 1-20?我试过了,它对我有用
-
例如将您的代码包装在
for i in range(20): res = requests.get('https://news.ycombinator.com/news?p={page}'.format(page=i))中,与 How can I loop scraping data for multiple pages in a website using python and beautifulsoup4 相同
标签: python-3.x web-scraping beautifulsoup python-requests