带生成器的刮页答案

【问题标题】：Scrape page with generator带生成器的刮页
【发布时间】：2014-06-24 11:16:50
【问题描述】：

我用 Beautiful Soup 抓取了一个网站。我遇到的问题是网站的某些部分使用 JS 进行分页，需要抓取的页面数量未知（变化）。
我试图用一个生成器来解决这个问题，但这是我第一次写一个，我很难把头绕在它周围，弄清楚我在做什么是否有意义。

代码：

from bs4 import BeautifulSoup
import urllib
import urllib2
import jabba_webkit as jw
import csv
import string
import re
import time

tlds = csv.reader(open("top_level_domains.csv", 'r'), delimiter=';')
sites = csv.writer(open("websites_to_scrape.csv", "w"), delimiter=',')

tld = "uz"
has_next = True
page = 0

def create_link(tld, page):
    if page == 0:
        link = "https://domaintyper.com/top-websites/most-popular-websites-with-" + tld + "-domain"
    else:
        link = "https://domaintyper.com/top-websites/most-popular-websites-with-" + tld + "-domain/page/" + repr(page)

    return link

def check_for_next(soup):
    disabled_nav = soup.find(class_="pagingDivDisabled")

    if disabled_nav:
        if "Next" in disabled_nav:
            return False
        else:
            return True
    else:
        return True


def make_soup(link):
    html = jw.get_page(link)
    soup = BeautifulSoup(html, "lxml")

    return soup

def all_the_pages(counter):
    while True: 
        link = create_link(tld, counter)
        soup = make_soup(link)
        if check_for_next(soup) == True:
            yield counter
        else:
            break
        counter += 1

def scrape_page(soup):
    table = soup.find('table', {'class': 'rankTable'})
    th = table.find('tbody')
    test = th.find_all("td")

    correct_cells = range(1,len(test),3)
    for cell in correct_cells:
        #print test[cell]
        url = repr(test[cell])
        content = re.sub("<[^>]*>", "", url)
        sites.writerow([tld]+[content])


def main():

    for page in all_the_pages(0):

        print page
        link = create_link(tld, page)
        print link
        soup = make_soup(link)
        scrape_page(soup)






main()

我对代码的思考：
刮板应该获取页面，确定是否有另一个页面，刮掉当前页面并移动到下一个页面，重新启动该过程。如果没有下一页，它应该停止。我在这里怎么做有意义吗？

【问题讨论】：

你可以使用 selenium 来加载 JS 内容，找到可导航的按钮并仍然使用 BS4 解析 HTML 内容。
解析成功，JS不是问题。问题是我不知道“klick next”的频率。我试图通过每次抓取页面时检查是否还有“下一步”按钮来做到这一点。如果有，我想刮掉下一页，如果没有，我想打破。
分享一个网址，我可以在其中看到 HTML 的样子。
domaintyper.com/top-websites/…

标签： python web-scraping beautifulsoup generator

【解决方案1】：

正如我告诉你的，你可以使用 selenium 以编程方式单击Next 按钮，但由于这不是你的选择，我可以想到以下方法来使用纯 BS4 获取页数：

import requests
from bs4 import BeautifulSoup

def page_count():
    pages = 1    
    url = "https://domaintyper.com/top-websites/most-popular-websites-with-uz-domain/page/{}"

    while True:
        html = requests.get(url.format(pages)).content
        soup = BeautifulSoup(html)

        table = soup.find('table', {'class': 'rankTable'})
        if len(table.find_all('tr')) <= 1:
            return pages
        pages += 1

【讨论】：