Python Beautifulsoup，获取href标签，在标签中答案

【问题标题】：Python Beautifulsoup, get href tag, in a tagPython Beautifulsoup，获取href标签，在标签中
【发布时间】：2020-12-06 11:41:14
【问题描述】：

我在获取href 标签时遇到问题，所以我的情况是这样的，这是html 文件：

<div class="list-product with-sidebar">
 <a class="frame-item" href="./produk-a.html" target="_blank" title="Produk A">

 </a>
 <a class="frame-item" href="./produk-b.html" target="_blank" title="Produk B">

 </a>
</div>

这是我的代码

    def get_category_item_list(category):
        base_url = 'https://www.website.com/'
        res = session.get(base_url+category)
        res = BeautifulSoup(res.content, 'html.parser')
        all_title = res.findAll('a', attrs={'class':'frame-item'})
        data_titles = []
        for title in all_title:
            product_link = title.get('a')['href']
            data_titles.append(product_link)
        return data_titles

我想得到的是，href 链接.. 像这样

produk-a.html
produk-b.html

当我尝试运行它时.. 它不会让我在href 上获得链接，它们会给出错误代码：

TypeError: 'NoneType' object is not subscriptable

【问题讨论】：

标签： python html web-scraping beautifulsoup

【解决方案1】：

我相信你的问题出在这一行：

product_link = title.get('a')['href']

你已经有了一个“a”元素的列表，所以你可能只需要：

product_link = title['href']

【讨论】：

【解决方案2】：

为了您的确切输出，

您已经在迭代锚标记
您需要用“/”分割并选择最后一个元素

from bs4 import BeautifulSoup


html = """<div class="list-product with-sidebar">
 <a class="frame-item" href="./produk-a.html" target="_blank" title="Produk A">

 </a>
 <a class="frame-item" href="./produk-b.html" target="_blank" title="Produk B">

 </a>
</div>"""

res = BeautifulSoup(html, 'html.parser')

for a in res.findAll('a', attrs={'class':'frame-item'}):
    print(a["href"].split("/")[-1])

输出：

produk-a.html
produk-b.html

【讨论】：

【解决方案3】：

您没有与我们共享该网站，因此一个问题可能是该网站阻止了看起来像机器人的用户代理（请求的用户代理）。在这里调试可能会有所帮助，您可以使用resp.content/text 打印页面内容。

我创建了一个名为 index.html 的 HTML 文件，然后我读取了该文件并抓取了它的内容。我稍微改了一下代码，好像没问题。

soup.find 返回一个<class 'bs4.element.Tag'>，因此您可以使用attribute['a'] 访问它的属性。

from bs4 import BeautifulSoup

with open('index.html') as f:
    html_content = f.read()

soup = BeautifulSoup(html_content, 'html.parser')
data_titles = []
for a in soup.find('div', class_='list-product with-sidebar').find_all('a'):
    data_titles.append(a['href'].split('/')[1])
print(data_titles)
# ['produk-a.html', 'produk-b.html']

index.html

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1.0" />
        <title>Document</title>
    </head>
    <body>
        <div class="list-product with-sidebar">
            <a
                class="frame-item"
                href="./produk-a.html"
                target="_blank"
                title="Produk A"
            >
            </a>
            <a
                class="frame-item"
                href="./produk-b.html"
                target="_blank"
                title="Produk B"
            >
            </a>
        </div>
    </body>
</html>

【讨论】：