【问题标题】:Python - scraping a paginated site and writing the results to a filePython - 抓取分页站点并将结果写入文件
【发布时间】:2016-12-14 13:06:59
【问题描述】:

我是一个完整的编程初学者,所以如果我不能很好地表达我的问题,请原谅我。我正在尝试编写一个脚本,该脚本将浏览一系列新闻页面并记录文章标题及其链接。我已经设法为第一页完成了这项工作,问题是获取后续页面的内容。通过在 stackoverflow 中搜索,我想我设法找到了一种解决方案,该解决方案可以使脚本访问多个 URL,但它似乎覆盖了从它访问的每个页面中提取的内容,所以我总是得到相同数量的记录文章文件。可能有帮助的东西:我知道 URL 遵循以下模型:“/ultimas/?page=1”、“/ultimas/?page=2”等,它似乎使用 AJAX 请求新文章

这是我的代码:

import csv
import requests
from bs4 import BeautifulSoup as Soup
import urllib
r = base_url = "http://agenciabrasil.ebc.com.br/"
program_url = base_url + "/ultimas/?page="

for page in range(1, 4):
    url =  "%s%d" % (program_url, page)
    soup = Soup(urllib.urlopen(url))



letters = soup.find_all("div", class_="titulo-noticia")

letters[0]

lobbying = {}
for element in letters:
    lobbying[element.a.get_text()] = {}

letters[0].a["href"]
prefix = "http://agenciabrasil.ebc.com.br"

for element in letters:
    lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]



for item in lobbying.keys():
    print item + ": " + "\n\t" + "link: " + lobbying[item]["link"] + "\n\t"

import os, csv
os.chdir("...")

with open("lobbying.csv", "w") as toWrite:
    writer = csv.writer(toWrite, delimiter=",")
    writer.writerow(["name", "link",])
    for a in lobbying.keys():
        writer.writerow([a.encode("utf-8"), lobbying[a]["link"]])

        import json

with open("lobbying.json", "w") as writeJSON:
    json.dump(lobbying, writeJSON)

print "Fim"

任何关于如何将每个页面的内容添加到最终文件的帮助将不胜感激。谢谢!

【问题讨论】:

  • 看看像scrapy这样的工具可能也是个好主意
  • 我的问题已经被另一位发帖者解决了,但我还是会调查一下,谢谢你的建议!

标签: python ajax web-scraping beautifulsoup


【解决方案1】:

如果服务相同,这个怎么样:

import csv, requests
from lxml import html

base_url = "http://agenciabrasil.ebc.com.br"
program_url = base_url + "/ultimas/?page={0}"
outfile = open('scraped_data.csv', 'w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Caption","Link"])
for url in [program_url.format(page) for page in range(1, 4)]:
    response = requests.get(url)
    tree = html.fromstring(response.text)
    for title in tree.xpath("//div[@class='noticia']"):
        caption = title.xpath('.//span[@class="field-content"]/a/text()')[0]
        policy = title.xpath('.//span[@class="field-content"]/a/@href')[0] 
        writer.writerow([caption , base_url + policy])

【讨论】:

  • 哦,我最终对这个脚本进行了几次修改,以至于今天几乎无法识别它。但是感谢您的输入!确实,您的方法似乎更有效。我稍后会尝试。
【解决方案2】:

由于您的文件没有正确缩进,您的 for 循环 (for page in range(1, 4):) 中的代码似乎没有被调用:

如果你整理你的代码,它就可以工作:

import csv, requests, os, json, urllib
from bs4 import BeautifulSoup as Soup

r = base_url = "http://agenciabrasil.ebc.com.br/"
program_url = base_url + "/ultimas/?page="

for page in range(1, 4):
    url =  "%s%d" % (program_url, page)
    soup = Soup(urllib.urlopen(url))



    letters = soup.find_all("div", class_="titulo-noticia")

    lobbying = {}
    for element in letters:
        lobbying[element.a.get_text()] = {}

    prefix = "http://agenciabrasil.ebc.com.br"

    for element in letters:
        lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]



    for item in lobbying.keys():
        print item + ": " + "\n\t" + "link: " + lobbying[item]["link"] + "\n\t"

#os.chdir("...")

with open("lobbying.csv", "w") as toWrite:
    writer = csv.writer(toWrite, delimiter=",")
    writer.writerow(["name", "link",])
    for a in lobbying.keys():
        writer.writerow([a.encode("utf-8"), lobbying[a]["link"]])


with open("lobbying.json", "w") as writeJSON:
    json.dump(lobbying, writeJSON)

print "Fim"

【讨论】:

  • 伙计,非常感谢你,它就像我想要的那样工作。我仍然习惯于编码时必须有多么自律。
  • 熟能生巧! Python 是一种很好的抓取语言。编码愉快!
猜你喜欢
  • 1970-01-01
  • 2014-12-02
  • 2013-11-14
  • 2019-06-03
  • 1970-01-01
  • 2019-11-12
  • 2015-06-19
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多