使用 BeautifulSoup 进行网页抓取：检索网站的源代码答案

【问题标题】：Webscraping Using BeautifulSoup: Retrieving source code of a website使用 BeautifulSoup 进行网页抓取：检索网站的源代码
【发布时间】：2016-03-22 22:39:51
【问题描述】：

美好的一天！我目前正在为阿里巴巴网站制作网络爬虫。我的问题是返回的源代码没有显示我感兴趣的某些部分。当我使用浏览器查看源代码时，数据在那里，但使用 BeautifulSoup 时无法检索。有什么建议吗？

从 bs4 导入 BeautifulSoup

def make_soup(url):
    try:
        html = urlopen(url).read()
    except:
        return None
    return BeautifulSoup(html, "lxml")

url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144" soup2 = make_soup(url)

我对使用 Chrome 的开发者工具的图像中突出显示的部分感兴趣。但是当我尝试在文本文件中写入时，包括突出显示的某些部分无处可寻。有小费吗？蒂亚！

【问题讨论】：

他们可能正在从 js 客户端编写一些动态文档，可能是为了响应您尚未发出的 AJAX 请求。
代码块下面的两个赋值是代码块的一部分吗？

标签： python html beautifulsoup html-parsing

【解决方案1】：

您至少需要提供User-Agent 标头。

使用requests package 代替urllib2 的示例：

import requests
from bs4 import BeautifulSoup

def make_soup(url):
    try:
        html = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}).content
    except:
        return None
    return BeautifulSoup(html, "lxml")

url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup = make_soup(url)

print(soup.select_one("a.next").get('href'))

打印http://www.alibaba.com/catalogs/products/CID144/2。

【讨论】：

嗨！运行您的程序时出现此错误。 AttributeError：“NoneType”对象没有属性“get”