尝试使用 beautifulsoup 分析 HTML 时出现的一个奇怪问题答案

【问题标题】：a strange issue when trying to analysis HTML with beautifulsoup尝试使用 beautifulsoup 分析 HTML 时出现的一个奇怪问题
【发布时间】：2013-02-16 22:40:09
【问题描述】：

我正在尝试编写一些 python 代码来从官方网站收集音乐排行榜数据，但在收集广告牌数据时遇到了麻烦。我选择beautifulsoup 来处理HTML

我的环境： python-2.7 Beautifulsoup-3.2.0

首先我分析 HTML

>>> import BeautifulSoup, urllib2, re
>>> html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
>>> soup = BeautifulSoup.BeautifulSoup(html)

然后我尝试收集我想要的数据，例如艺术家姓名

HTML：

<div class="listing chart_listing">

<article id="node-1491420" class="song_review no_category chart_albumTrack_detail no_divider">
  <header>
    <span class="chart_position position-down">11</span>
            <h1>Ho Hey</h1>
        <p class="chart_info">
      <a href="/artist/418560/lumineers">The Lumineers</a>            <br>
      The Lumineers          </p>

艺术家的名字是 The Lumineers

>>> print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')})\
... .find("p", {"class":"chart_info"}).a.string)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'find'

无类型！似乎它无法 grep 我想要的数据，也许我的规则是错误的，所以我尝试 grep 一些基本标签。

>>> print str(soup.find("div"))
None
>>> print str(soup.find("a"))
None
>>> print str(soup.find("title"))
<title>The Hot 100 : Page 2  | Billboard</title>
>>> print str(soup)
......entire HTML.....

我很困惑，为什么不能 grep 像 div, a 这样的基本标签？他们确实在那里。我的代码有什么问题？当我尝试用这些来分析其他图表时没有任何问题。

【问题讨论】：

标签： python python-2.7 beautifulsoup urllib2

【解决方案1】：

这似乎是 Beautifulsoup 3 的问题。如果你 prettify() 输出：

from BeautifulSoup import BeautifulSoup as soup3
import urllib2, re

html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
soup = soup3(html)
print soup.prettify()

你可以在输出的最后看到：

        <script type="text/javascript" src="//assets.pinterest.com/js/pinit.js"></script>
</body>
</html>
  </script>
 </head>
</html>

由于有两个 html 结束标签，BeautifulSoup3 似乎被这些数据中的 Javascript 内容弄糊涂了。

如果你使用：

from bs4 import BeautifulSoup as soup4
import urllib2, re

html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
soup = soup4(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

你得到'The Lumineers'作为输出。

如果你不能切换到bs4，我建议你把html变量写到一个文件out.txt，然后把脚本改成读入in.txt，把输出复制到输入，切掉块。

from BeautifulSoup import BeautifulSoup as soup3
import re

html = open('in.txt').read()
soup = soup3(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

我的第一个猜测是删除<head> ... </head>，结果很神奇。

之后，您可以通过编程方式解决该问题：

from BeautifulSoup import BeautifulSoup as soup3
import urllib2, re

htmlorg = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
head_start = htmlorg.index('<head')
head_end = htmlorg.rindex('</head>')
head_end = htmlorg.index('>', head_end)
html = htmlorg[:head_start] + htmlorg[head_end+1:]
soup = soup3(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

【讨论】：