【发布时间】:2022-01-22 21:28:22
【问题描述】:
我正在尝试构建我的第一个网络抓取工具,但我不知道如何阻止我的程序查找“下一页”链接。
#get URLs for all pages
def page_parse(main_url, url_list):
page = requests.get(main_url);
soup = BeautifulSoup(page.content, 'html.parser');
#check if next page button inactive
if soup.find('a.next.ajax-page', href=True) == None:
print('debug');
return url_list;
next_page = soup.select_one('a.next.ajax-page', href=True)['href']
next_page = (f'http://www.yellowpages.com{next_page}')
url_list.append(next_page);
print(str(url_list))
page_parse(next_page, url_list);
return url_list;
我知道错误是什么我只是不知道如何检查“下一页”按钮是否处于活动状态。我尝试寻找第一页和最后一页的“下一页”按钮之间的 html 差异(第一页使用 a.next.ajax-page,而最后一页使用 div.next)。根据我对代码所做的更改,要么点击 print('debug'),要么到达最后一页并点击 TypeError [见下文]。我认为问题在于不调用元素就无法检查元素是否存在。
错误代码:
['http://www.yellowpages.com/omaha-ne/towing?page=2']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3', 'http://www.yellowpages.com/omaha-ne/towing?page=4']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3', 'http://www.yellowpages.com/omaha-ne/towing?page=4', 'http://www.yellowpages.com/omaha-ne/towing?page=5']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3', 'http://www.yellowpages.com/omaha-ne/towing?page=4', 'http://www.yellowpages.com/omaha-ne/towing?page=5', 'http://www.yellowpages.com/omaha-ne/towing?page=6']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3', 'http://www.yellowpages.com/omaha-ne/towing?page=4', 'http://www.yellowpages.com/omaha-ne/towing?page=5', 'http://www.yellowpages.com/omaha-ne/towing?page=6', 'http://www.yellowpages.com/omaha-ne/towing?page=7']
Traceback (most recent call last):
File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 49, in <module>
url_list = page_parse(main_url, url_list);
File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 19, in page_parse
page_parse(next_page, url_list);
File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 19, in page_parse
page_parse(next_page, url_list);
File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 19, in page_parse
page_parse(next_page, url_list);
[Previous line repeated 3 more times]
File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 15, in page_parse
next_page = soup.select_one('a.next.ajax-page', href=True)['href']
TypeError: 'NoneType' object is not subscriptable
对不起,如果这令人困惑,这是我第一次发布问题。
【问题讨论】:
标签: python html web-scraping beautifulsoup