beautifulsoup - 在子 div 中提取链接、文本和标题答案

【问题标题】：beautifulsoup - extracting link, text, and title within child divbeautifulsoup - 在子 div 中提取链接、文本和标题
【发布时间】：2018-04-16 16:23:07
【问题描述】：

布局如下：

<div class="App">
    <div class="content">
        <div class="title">Application Name #1</div>
        <div class="image" style="background-image: url(https://img_url)">
        </div>
        <a href="http://app_url" class="signed button">install app</a>
    </div>
</div>

我正在尝试获取 TITLE，然后是 APP_URL，理想情况下，当我通过 html 打印时，我希望 TITLE 成为 APP_URL 的超链接。

我的代码是这样的，但不会产生期望的结果。我相信我需要在循环中添加另一个命令来获取标题。唯一的问题是，我如何确保我抓住了 TITLE 和 APP_URL 以便它们一起出现？至少有 15 个应用程序的类为 <div class="App">。当然，我也想要所有 15 个结果。

重要提示：对于 href 链接，我需要来自名为 "signed button" 的类。

soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'App'}):
    a = div.findAll('a')[1]
    print a.text.strip(), '=>', a.attrs['href']

【问题讨论】：

标签： python html web-scraping beautifulsoup

【解决方案1】：

也许这样的事情会起作用？

soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'App'}):
    a = div.findAll('a')[0]
    print div.findAll('div', {'class': 'title'})[0].text, '=>', a.attrs['href']

【讨论】：

重要的是 href 链接来自名为 "signed button" 的类。 - 此外，该网站下方有一个标签，上面写着“加载更多”，因此除非最终用户点击“加载更多”，否则它不会获取所有内容。我怎样才能超越这个？

【解决方案2】：

使用 CSS 选择器：

from bs4 import BeautifulSoup

html = """
<div class="App">
    <div class="content">
        <div class="title">Application Name #1</div>
        <div class="image" style="background-image: url(https://img_url)">
        </div>
        <a href="http://app_url" class="signed button">install app</a>
    </div>
</div>"""

soup = BeautifulSoup(html, 'html5lib')

for div in soup.select('div.App'):
    title = div.select_one('div.title')
    link = div.select_one('a')

    print("Click here: <a href='{}'>{}</a>".format(link["href"], title.text))

产量

Click here: <a href='http://app_url'>Application Name #1</a>

【讨论】：

我收到此错误：soup = BeautifulSoup(my_url, 'html5lib') File "/Library/Python/2.7/site-packages/beautifulsoup4-4.6.0-py2.7.egg/bs4/__init__.py", line 165, in __init__ bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library? 但是，当我 pip install html5lib 时，我收到此消息：Requirement already satisfied:
我们可以私聊吗？
好吧，我改成你说的，但是，它只产生了一半的结果。诡异的。哦，快。该网站下方有一个标签，上面写着“加载更多”，因此除非最终用户点击“加载更多”，否则它不会获取所有内容。我怎样才能超越这个？