循环页面并在 Python 中抓取内容答案

【问题标题】：Loop pages and crawler the contents in Python循环页面并在 Python 中抓取内容
【发布时间】：2021-03-11 07:43:47
【问题描述】：

我想爬取this link的内容：

如何循环所有页面并抓取红色圆圈中的所有元素？谢谢。

代码：

from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse

url = 'http://www.eoechina.com.cn/cn2019/gonggaoxinxi.html?classID=1'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup)

【问题讨论】：

标签： python-3.x web-scraping beautifulsoup python-requests web-crawler

【解决方案1】：

您可以查询一个端点以循环浏览页面。

方法如下：

from urllib.parse import urlencode
import requests
import pandas as pd

end_point = "http://www.eoechina.com.cn/cn2016/mobile/GetArticleList.ashx"

payload = {
    "pageNumber": 1,
    "classID": 1,
    "searchKey": "",
    "selectItemID": "72,"
}

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:86.0) "
                  "Gecko/20100101 Firefox/86.0",
    "X-Requested-With": "XMLHttpRequest",
}

for page in range(1, 5):
    payload["pageNumber"] = page
    response = requests.post(
        end_point,
        data=urlencode(payload),
        headers=headers,
    ).json()
    # print("\n".join(item["title"] for item in response))
    df = pd.DataFrame(response)
    print(df)

示例输出：（这是截图，因为 SO 认为输出是垃圾邮件...）

【讨论】：

谢谢，我如何提取response 的所有项目并将"articleID", "title", "addTime", "qdPrice", "states" 附加为数据框？
响应是一个字典列表，所以只需将其转储到pandas DataFrame。
还有一个问题，如何保存每一项的链接，比如eoechina.com.cn/cn2019/articleDetails1.html?articleID=22695？
使用响应中的文章 ID 并自己构建 URL。例如，http://www.eoechina.com.cn/cn2019/articleDetails1.html?articleID=THE_ARTCILE_ID_FROM_THE_RESPONSE
不知道。但是试试看，实验！看看这会把你带到哪里。 :)