【发布时间】:2017-06-01 07:16:09
【问题描述】:
我创建了一个爬虫,它从网站解析某些内容。
首先,它从左侧栏中抓取指向该类别的链接。
其次,它收集通过连接到个人资料页面的分页传播的整个链接
最后,转到每个个人资料页面,它会抓取姓名、电话和网址。
到目前为止,它做得很好。我看到这个爬虫的唯一问题是它总是从第二页开始抓取,跳过第一页。我想可能有什么办法可以解决这个问题。这是我正在尝试的完整代码:
import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def category_links(mainurl):
req=requests.Session()
response = req.get(mainurl).text
tree = html.fromstring(response)
for titles in tree.xpath("//a[@class='sidebar-item-label']/@href"):
next_pagelink(titles) # links to the category from left-sided bar
def next_pagelink(process_links):
req=requests.Session()
response = req.get(process_links).text
tree = html.fromstring(response)
for link in tree.xpath("//ul[@class='pagination']//a[@class='pageNumber']/@href"):
profile_pagelink(link) # the whole links spread through pagination connected to the profile page
def profile_pagelink(procured_links):
req=requests.Session()
response = req.get(procured_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[@class='name-info']"):
links = titles.xpath(".//a[@class='pro-title']/@href")[0]
target_pagelink(links) # profile page of each link
def target_pagelink(main_links):
req=requests.Session()
response = req.get(main_links).text
tree = html.fromstring(response)
def if_exist(titles,xpath):
info=titles.xpath(xpath)
if info:
return info[0]
return ""
for titles in tree.xpath("//div[@class='container']"):
name = if_exist(titles,".//a[@class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', @class, ' '), ' click-to-call-link ')]/@phone")
web = if_exist(titles,".//a[@class='proWebsiteLink']/@href")
print(name,phone,web)
category_links(url)
【问题讨论】:
标签: python-3.x web-scraping web-crawler