【发布时间】:2017-10-03 06:33:56
【问题描述】:
运行我用 python 编写的脚本,我可以看到一堆重复的结果。有什么解决方法可以摆脱这种重复吗?这是我的脚本:
import requests
from lxml import html
def Startpoint():
default="http://tennishub.co.uk"
link="http://tennishub.co.uk/"
response = requests.get(link)
tree = html.fromstring(response.text)
titles = tree.xpath('//div[@class="countylist"]')
for title in titles:
links = title.xpath('.//a/@href')
for link in links:
page = default + link
Midpoint(page)
def Midpoint(address):
default="http://tennishub.co.uk"
response = requests.get(address)
tree = html.fromstring(response.text)
titles = tree.xpath('//div[@class="pagination"]')
for title in titles:
links = title.xpath('.//a/@href')
for link in links:
mlink = default + link
print(mlink)
Startpoint()
这是我得到的截图:
【问题讨论】:
-
抓取链接时,将 URL 添加到
set。在抓取链接之前,请检查它是否在集合中。
标签: python web-scraping web-crawler