为什么我的 python 抓取脚本不适用于不同的 url？答案

【问题标题】：why my python scraping script is not working for different urls?为什么我的 python 抓取脚本不适用于不同的 url？
【发布时间】：2016-07-18 04:18:57
【问题描述】：

我使用 python 为 2 个不同的 url 编写了一个抓取脚本

http://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA

http://www.yellowpages.com.au/search/listings?clue=concrete+contractors&locationClue=nsw+australia&lat=&lon=&selectedViewMode=list

对于第一个网址，我编写了以下脚本

import requests 
from bs4 import BeautifulSoup 
url = requests.get("http://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA") 


url.content 
soup = BeautifulSoup(url.content) 
print (soup.prettify()) 


g_data = soup.find_all("div", {"class": "info"})
for item in g_data:
print (item.contents[0].find_all("a", {"class": "business-name"})[0].text)

它打印了企业名称中的所有文本。但是，当我对第二个 url 使用相同结构但不同的脚本时，它会获取 url 内容，但不像第一个 url 那样从页面中获取全部内容。

第二个网址脚本

import requests 
from bs4 import BeautifulSoup 
url = requests.get("http://www.yellowpages.com.au/search/listings?clue=concrete+contractors&locationClue=nsw+australia&lat=&lon=&selectedViewMode=list") 


url.content 
soup = BeautifulSoup(url.content) 
print (soup.prettify()) 


g_data = soup.find_all("div", {"class": "body left"})
for item in g_data:
print (item.contents[0].find_all("a", {"class": "listing-name"})[0].text)

我的问题是为什么它不能作为第一个脚本并且没有给出企业名称

【问题讨论】：

标签： python beautifulsoup

【解决方案1】：

您应该首先看到soup.prettify() 的内容。 www.yellowpages.com.au 的网站可能有防火墙保护，确保真实的人访问他们的信息。
如果第一步没问题，那么您可以获取网站内容并调试其余代码。可能find_all的attr参数有误，参考Searching by CSS class

【讨论】：

soup.prettify() 正在工作，但它没有像第一个 url 一样提取页面源的整个 html 内容。我怎样才能将 html 内容作为第一页？
当您获取第二个 url 的内容时，网站检测到异常的流量活动。如果你运行你的脚本，那么你通过浏览器打开第二个 url，这将被重定向到 HCI 验证页面，你必须绕过验证。您可以在 google 中搜索绕过 reCaptcha。