【问题标题】:Having trouble parsing through page numbers in python在 python 中解析页码时遇到问题
【发布时间】:2022-01-22 21:28:22
【问题描述】:

我正在尝试构建我的第一个网络抓取工具,但我不知道如何阻止我的程序查找“下一页”链接。

#get URLs for all pages
def page_parse(main_url, url_list):
    page = requests.get(main_url);
    soup = BeautifulSoup(page.content, 'html.parser');
    #check if next page button inactive
    if soup.find('a.next.ajax-page', href=True) == None:
        print('debug');
        return url_list;
    next_page = soup.select_one('a.next.ajax-page', href=True)['href']
    next_page = (f'http://www.yellowpages.com{next_page}')
    url_list.append(next_page);
    print(str(url_list))
    page_parse(next_page, url_list);
    return url_list;

我知道错误是什么我只是不知道如何检查“下一页”按钮是否处于活动状态。我尝试寻找第一页和最后一页的“下一页”按钮之间的 html 差异(第一页使用 a.next.ajax-page,而最后一页使用 div.next)。根据我对代码所做的更改,要么点击 print('debug'),要么到达最后一页并点击 TypeError [见下文]。我认为问题在于不调用元素就无法检查元素是否存在。

错误代码:

['http://www.yellowpages.com/omaha-ne/towing?page=2']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3', 'http://www.yellowpages.com/omaha-ne/towing?page=4']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3', 'http://www.yellowpages.com/omaha-ne/towing?page=4', 'http://www.yellowpages.com/omaha-ne/towing?page=5']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3', 'http://www.yellowpages.com/omaha-ne/towing?page=4', 'http://www.yellowpages.com/omaha-ne/towing?page=5', 'http://www.yellowpages.com/omaha-ne/towing?page=6']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3', 'http://www.yellowpages.com/omaha-ne/towing?page=4', 'http://www.yellowpages.com/omaha-ne/towing?page=5', 'http://www.yellowpages.com/omaha-ne/towing?page=6', 'http://www.yellowpages.com/omaha-ne/towing?page=7']
Traceback (most recent call last):
  File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 49, in <module>  
    url_list = page_parse(main_url, url_list);
  File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 19, in page_parse
    page_parse(next_page, url_list);
  File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 19, in page_parse
    page_parse(next_page, url_list);
  File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 19, in page_parse
    page_parse(next_page, url_list);
  [Previous line repeated 3 more times]
  File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 15, in page_parse
    next_page = soup.select_one('a.next.ajax-page', href=True)['href']
TypeError: 'NoneType' object is not subscriptable

对不起,如果这令人困惑,这是我第一次发布问题。

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:

    这里的问题是您试图访问NoneType 变量。 next_page = soup.select_one('a.next.ajax-page', href=True) 不返回任何内容,因此您无法访问内部的['href']

    【讨论】:

      【解决方案2】:

      会发生什么?

      您的选择 soup.find('a.next.ajax-page', href=True) 没有以任何方式找到您正在搜索的元素,因为它是语法(查找和 css 选择器)的混合,并且将始终返回 None - 所以它也将无法访问属性值。

      如何解决?

      从以下位置更改检查下一页元素的行:

      if soup.find('a.next.ajax-page', href=True) == None:
      

      到:

      if soup.find('a',{'class':'next ajax-page'}) == None:
      

      if soup.select_one('a.next.ajax-page') == None:
      

      您还应该能够抓取搜索结果的所有基本信息并将其存储在一个步骤中,而不是返回搜索页面的 url 列表:

      def page_parse(url):
          data = []
          while True:
              page = requests.get(url)
              soup = BeautifulSoup(page.text)
              for item in soup.select('div.result'):
                  data.append({
                      'title':item.h2.text,
                      'url':f"{baseUrl}{item.a['href']}"
                  })
      
              if (url := soup.select_one('a.next.ajax-page')):
                  url = f"{baseUrl}{url['href']}"
              else:
                  return data
      

      示例

      import requests
      from bs4 import BeautifulSoup
      
      baseUrl = 'http://www.yellowpages.com'
      
      def page_parse(url):
          data = []
          while True:
              page = requests.get(url)
              soup = BeautifulSoup(page.text)
              for item in soup.select('div.result'):
                  data.append({
                      'title':item.h2.text,
                      'url':f"{baseUrl}{item.a['href']}"
                  })
      
              if (url := soup.select_one('a.next.ajax-page')):
                  url = f"{baseUrl}{url['href']}"
              else:
                  return data
      
      page_parse('http://www.yellowpages.com/omaha-ne/towing')
      

      输出

      [{'title': "1. Keith's BP",
        'url': 'http://www.yellowpages.com/omaha-ne/mip/keiths-bp-460502890?lid=1002059325385'},
       {'title': '2. Neff Towing Svc',
        'url': 'http://www.yellowpages.com/omaha-ne/mip/neff-towing-svc-21969600?lid=1000282974083#gallery'},
       {'title': '3. A & A Towing',
        'url': 'http://www.yellowpages.com/omaha-ne/mip/a-a-towing-505777665?lid=1002056319136'},
       {'title': '4. Cross Electronic Recycling',
        'url': 'http://www.yellowpages.com/omaha-ne/mip/cross-electronic-recycling-473693798?lid=1000236876513'},
       {'title': '5. 24 Hour Towing',
        'url': 'http://www.yellowpages.com/omaha-ne/mip/24-hour-towing-521607477?lid=1001918028003'},
       {'title': '6. A & A Towing Fast Friendly',
        'url': 'http://www.yellowpages.com/omaha-ne/mip/a-a-towing-fast-friendly-478453697?lid=1000090213043'},
       {'title': '7. Austin David Towing',
        'url': 'http://www.yellowpages.com/omaha-ne/mip/austin-david-towing-465037110?lid=1001788338357'},...]
      

      【讨论】:

        猜你喜欢
        • 2011-03-16
        • 2012-12-20
        • 2015-09-04
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-02-05
        • 2014-02-24
        相关资源
        最近更新 更多