【发布时间】:2011-11-30 23:57:38
【问题描述】:
我正在尝试从一个网站上抓取多个页面以供 BeautifulSoup 解析。到目前为止,我已经尝试使用 urllib2 来执行此操作,但遇到了一些问题。我尝试的是:
import urllib2,sys
from BeautifulSoup import BeautifulSoup
for numb in ('85753', '87433'):
address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb)
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
title = soup.find("span", {"class":"paperstitle"})
date = soup.find("span", {"class":"docdate"})
span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit
paras = [x for x in span.findAllNext("p")]
first = title.string
second = date.string
start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)
这只会给我numb 序列中第二个数字的结果,即http://www.presidency.ucsb.edu/ws/index.php?pid=87433。我也尝试过使用机械化,但没有成功。理想情况下,我想做的是有一个带有链接列表的页面,然后自动选择一个链接,将 HTML 传递给 BeautifulSoup,然后移动到列表中的下一个链接。
【问题讨论】:
标签: python web-scraping urllib2