使用 Python 2.7.9 网络爬虫进行分页答案

【问题标题】：Paginating with Python 2.7.9 Web Crawler使用 Python 2.7.9 网络爬虫进行分页
【发布时间】：2015-10-06 16:46:48
【问题描述】：

我正在尝试用 Python 2.7.9 编写一个程序，以从 http://tennishub.co.uk/ 网站上抓取和收集俱乐部名称、地址和电话号码

以下代码完成了工作，除了它不会移动到每个位置的后续页面，例如

/Berkshire/1
/Berkshire/2
/Berkshire/3

..等等。

import requests
from bs4 import BeautifulSoup


def tennis_club():
    url = 'http://tennishub.co.uk/'
    r = requests.get(url)
    soup = BeautifulSoup(r.text)
    for link in soup.select('div.countylist a'):
        href = 'http://tennishub.co.uk' + link.get('href')
        pages_data(href)


def pages_data(item_url):
    r = requests.get(item_url)
    soup = BeautifulSoup(r.text)
    g_data = soup.select('table.display-table')

    for item in g_data:
        print item.contents[1].text
        print item.contents[3].findAll('td')[1].text
        try:
            print item.contents[3].find_all('td',{'class':'telrow'})[0].text
        except:
            pass
        try:
            print item.contents[5].findAll('td',{'class':'emailrow'})[0].text
        except:
            pass
        print item_url


tennis_club()

我已尝试根据我的理解调整代码，但它根本不起作用。

谁能告诉我我需要做什么，以便程序遍历一个位置的所有页面，收集数据并移动到下一个位置等等。

【问题讨论】：

标签： python-2.7 pagination beautifulsoup web-crawler

【解决方案1】：

您需要在此代码中添加另一个 for 循环：

for link in soup.select('div.countylist a'):
    href = 'http://tennishub.co.uk' + link.get('href')
    # new for loop goes here #
        pages_data(href)

如果你想强制它，你只需让 for 循环与拥有最多俱乐部的区域（萨里）一样多，但是你会加倍、三倍、四倍等。计算许多最后的俱乐部领域。这很丑陋，但是如果您使用不插入重复项的数据库，则可以摆脱它。但是，如果您正在写入文件，这是不可接受的。在这种情况下，您需要在 Berkshire (39) 区域之后的括号中提取数字。要获得该号码，您可以在 div.countylist 上执行 get_text()，这会将上述内容更改为

for link in soup.select('div.countylist'):
     for endHref in link.find_all('a'):
          numClubs = endHref.next
          #need to clean up endHrefNum here to remove spaces and parens
          endHrefNum = numClubs//10 + 1  #add one because // gives the floor
          href = 'http://tennishub.co.uk' + endHref.get('href') + / + endHrefNum
          pages_data(href)

（免责声明：我没有通过 bs4 运行此程序，因此可能存在语法错误（您可能需要使用 .next 以外的其他内容，但逻辑应该对您有所帮助）

【讨论】：