【发布时间】:2015-11-30 02:26:24
【问题描述】:
我正在构建一个简单的程序来从 URL 列表中获取所有页面标题,然后将它们写入 CSV 文件。我已经完成并理解了大多数部分,除了一件事:无论我如何更改代码,我都会一遍又一遍地收到 Key Error。请看一下并告诉我这些代码有什么问题:
import requests
import json
import urllib2
import csv
from BeautifulSoup import BeautifulSoup
def getsnapshot(domain):
base = 'http://archive.org/wayback/available?url='
r = requests.get(base+domain, verify=False)
j = json.loads(r.text)
if j['archived_snapshots'] == {}:
pass
else:
archive_url = j['archived_snapshots']['closest']['url']
return archive_url
def gettitle(url):
soup = BeautifulSoup(urllib2.urlopen(getsnapshot(url)))
return soup.title.string
def writecsv(domain):
c = csv.writer(open("output.csv", "wb"))
snapshoturl = getsnapshot(domain)
title = gettitle(snapshoturl)
c.writerow([domain,title])
with open('input.txt', 'r') as f:
for line in f.read().splitlines():
writecsv(line)
我的输入只是一个 URL 列表,特别是域名。我正在检查域历史记录,看看过去是否有垃圾邮件。
这是 JSON
{
"archived_snapshots": {
"closest": {
"available": true,
"url": "http://web.archive.org/web/20050408030822/http://www.001music.net:80/",
"timestamp": "20050408030822",
"status": "200"
}
}
}
【问题讨论】:
标签: python json beautifulsoup