在 Python 中抓取分页站点并附加输出答案

【问题标题】：Scraping paginated sites and appending output in Python在 Python 中抓取分页站点并附加输出
【发布时间】：2014-12-02 11:30:34
【问题描述】：

我有一个简单的抓取任务，我想提高分页效率，并追加列表 这样我就可以将抓取的结果输出到一个通用/单个文件中。

当前的任务是为圣保罗市制定市政法规，遍历前 10 页。我想找到一种方法来确定分页的总页数，并让脚本自动循环浏览所有页面，与此类似：Handling pagination in lxml。

目前分页链接的 xpath 定义太差，我无法理解如何有效地做到这一点。比如first或last page（1或1608）只有三个li节点，而页面page 1605有6个节点。

/html/body/div/section/ul[2]/li/a

我怎样才能有效地解释这个分页；以自动方式而不是手动方式确定页面，以及如何正确指定 xpath 以循环浏览所有适当的页面，而不会重复？

现有代码如下：

#! /usr/bin/env python
# -*- coding: utf-8 -*-

import requests  
from lxml import html

base_url = "http://www.leismunicipais.com.br/legislacao-municipal/5298/leis-de-sao-paulo?q=&page=%d&types=o" 
for url in [base_url % i for i in xrange(10)]:
    page = requests.get(url)
    tree = html.fromstring(page.text)

    #This will create a list of titles:
    titles = tree.xpath('/html/body/div/section/ul/li/a/strong/text()')
    #This will create a list of descriptions:
    desc = tree.xpath('/html/body/div/section/ul/li/a/text()')
    #This will create a list of URLs
    url = tree.xpath('/html/body/div/section/ul/li/a/@href')

    print 'Titles: ', titles
    print 'Description: ', desc
    print 'URL: ', url

其次，我如何编译/附加这些结果并将它们写入 JSON、SQL 等？由于熟悉，我更喜欢 JSON，但对目前该怎么做。

【问题讨论】：

标签： python pagination web-scraping lxml python-requests

【解决方案1】：

您需要检查页面/站点的数据布局。每个站点都不一样。寻找“分页”或“下一个”或一些滑块。提取详细信息/计数并在循环中使用它。
导入 json 库。你有一个 json 转储函数...

【讨论】：

【解决方案2】：

虽然我无法正确理解您的问题，但此代码将有助于激发您的新尝试。代码兼容python 3及更高版本。

import requests  
from lxml import html

result = {}
base_url = "https://leismunicipais.com.br/legislacao-municipal/5298/leis-de-sao-paulo?q=&page={0}&types=28&types=5" 
for url in [base_url .format(i) for i in range(1,3)]:
    tree = html.fromstring(requests.get(url).text)
    for title in tree.cssselect(".item-result"):
        try:
            name = ' '.join(title.cssselect(".title a")[0].text.split())
        except Exception:
            name = ""

        try:
            url = ' '.join(title.cssselect(".domain")[0].text.split())
        except Exception:
            url = ""
        result[name] = url

print(result)

部分输出：

{'Decreto 57998/2017': 'http://leismunicipa.is', 'Decreto 58009/2017': 'http://leismunicipa.is'}

【讨论】：