我将如何遍历每个页面并抓取特定参数？答案

【问题标题】：How would I loop through each page and scrape the specific parameters?我将如何遍历每个页面并抓取特定参数？
【发布时间】：2021-12-12 06:12:20
【问题描述】：

#Import Needed Libraries    
import requests
from bs4 import BeautifulSoup 
import pprint


res = requests.get('https://news.ycombinator.com/news')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.select('.titlelink')
subtext = soup.select('.subtext')


def sort_stories_by_votes(hnlist):  #Sorting your create_custom_hn dict by votes(if)
    return sorted(hnlist, key= lambda k:k['votes'], reverse=True)


def create_custom_hn(links, subtext): #Creates a list of links and subtext
    hn = []
    for idx, item in enumerate(links): #Need to use this because not every link has a lot of votes
        title = links[idx].getText()
        href = links[idx].get('href', None)
        vote = subtext[idx].select('.score')
        if len(vote):
            points = int(vote[0].getText().replace(' points', ''))
            if points > 99:  #Only appends stories that are over 100 points
                hn.append({'title': title, 'link': href, 'votes': points})
    return sort_stories_by_votes(hn)


pprint.pprint(create_custom_hn(links, subtext))

我的问题是，这只是第一页，只有 30 个故事。

我将如何通过浏览每个页面来应用我的网络抓取方法......假设接下来的 10 个页面并保留上面的格式化代码？

【问题讨论】：

我是否需要将整个代码放入一个范围为 1-20 的 for 循环中？那么使用.format方法呢？
您是否尝试过使用 .format 方法将其放入循环中，范围为 1-20？我试过了，它对我有用
例如将您的代码包装在 for i in range(20): res = requests.get('https://news.ycombinator.com/news?p={page}'.format(page=i)) 中，与 How can I loop scraping data for multiple pages in a website using python and beautifulsoup4 相同

标签： python-3.x web-scraping beautifulsoup python-requests

【解决方案1】：

每个页面的网址是这样的

https://news.ycombinator.com/news?p=<page_number>

使用for-loop 从每个页面抓取内容。请参阅下面的代码。

这是打印前两页内容的代码。您可以根据需要更改 page_no。

import requests
from bs4 import BeautifulSoup 
import pprint

def sort_stories_by_votes(hnlist):  #Sorting your create_custom_hn dict by votes(if)
    return sorted(hnlist, key= lambda k:k['votes'], reverse=True)


def create_custom_hn(links, subtext, page_no): #Creates a list of links and subtext
    hn = []
    for idx, item in enumerate(links): #Need to use this because not every link has a lot of votes
        title = links[idx].getText()
        href = links[idx].get('href', None)
        vote = subtext[idx].select('.score')
        if len(vote):
            points = int(vote[0].getText().replace(' points', ''))
            if points > 99:  #Only appends stories that are over 100 points
                hn.append({'title': title, 'link': href, 'votes': points})
    return sort_stories_by_votes(hn)


for page_no in range(1,3):
    print(f'Page: {page_no}')
    url = f'https://news.ycombinator.com/news?p={page_no}'
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    links = soup.select('.titlelink')
    subtext = soup.select('.subtext')
    pprint.pprint(create_custom_hn(links, subtext, page_no))

Page: 1

[{'link': 'https://www.thisworddoesnotexist.com/',
  'title': 'This word does not exist',
  'votes': 904},
 {'link': 'https://www.sparkfun.com/news/3970',
  'title': 'A patent troll backs off',
  'votes': 662},
.
.

Page: 2

[{'link': 'https://www.vice.com/en/article/m7vqkv/how-fbi-gets-phone-data-att-tmobile-verizon',
  'title': "The FBI's internal guide for getting data from AT&T, T-Mobile, "
           'Verizon',
  'votes': 802},
 {'link': 'https://www.dailymail.co.uk/news/article-10063665/Government-orders-Google-track-searching-certain-names-addresses-phone-numbers.html',
  'title': 'Feds order Google to track people searching certain names or '
           'details',
  'votes': 733},
.
.

【讨论】：