使用 Beautiful Soup 抓取 id="author" [关闭]答案

【问题标题】：Scraping for id="author" using Beautiful Soup [closed]使用 Beautiful Soup 抓取 id="author" [关闭]
【发布时间】：2018-05-26 21:33:00
【问题描述】：

我正在学习如何使用 Python 进行网页抓取，并获得了以下 html 文件：

<html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>

我打开文件并将其读入变量 exampleSoup。然后我想为作者刮掉它并被告知使用

elems = exampleSoup.select('#author')

然而，这返回了一个空列表。然后我尝试了

elems = exampleSoup.select('span#author')

得到了我想要的输出。

我的问题是，为什么第一种方法在这种情况下不起作用？

【问题讨论】：

尝试使用.select("[id='author']") 并告诉我它会产生什么。
是的，按要求工作。 .select('#"author"') 也是如此。我猜“作者”周围的引号对 Beautiful Soup 或 Python 的当前版本都有影响。
@ToddBurus 不，先生，作者周围的引号无关紧要。
这不能用当前版本的 BeautifulSoup 和各种解析器后端复制。我们需要查看更多详细信息，例如您如何创建exampleSoup 和import bs4; print(bs4.__version__) 输出什么，以及html5lib 或lxml.etree 的相同__version__ 属性（如果您使用任一解析器而不是那个解析器）与 Python 捆绑在一起。然而，这更有可能是 BeautifulSoup 版本问题。

标签： python beautifulsoup css-selectors

【解决方案1】：

我认为是python的版本导致了这个问题

我是 usimg：Python 3.6.2 和 bs 4.6.0

这是我的方法

from bs4 import  BeautifulSoup

content = '<html><head><title>The Website Title</title></head><body><p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p><p class="slogan">Learn Python the easy way!</p><p>By <span id="author">Al Sweigart</span></p></body></html>'
soup = BeautifulSoup(content, 'html.parser')

result1 = soup.select("[id='author']")
print (result1) # output [<span id="author">Al Sweigart</span>]

result2 = soup.select('#author')
print (result2) # output [<span id="author">Al Sweigart</span>]

result3 = soup.select('span#author')
print (result3) # output [<span id="author">Al Sweigart</span>]

result4 = soup.span # this how the decumentation did it 
print (result4) # output <span id="author">Al Sweigart</span>

【讨论】：

这是可能的。我正在使用 Python 3.6.5。

【解决方案2】：

    from bs4 import BeautifulSoup
    htmlFile = """<html>
    <head>
    <title>The Website Title</title>
    </head>
    <body>
<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body>
</html>"""

    soup=BeautifulSoup(htmlFile, 'html.parser')
    print(soup.select("#author"))

我收到了所需的输出： [<span id="author">Al Sweigart</span>] 也许您使用的是旧版本的模块。

【讨论】：

我刚刚安装了 BeautifulSoup 4.6.0。奇怪的是，我按照您的描述运行了这个程序并得到了所需的输出。但后来我运行了 elems = soup.select("#author") 和 print 命令，它再次给了我一个空列表！