使用 BeautifulSoup 从 <content:encoded> 中检索 url答案

【问题标题】：Retriveing url from within <content:encoded> using BeautifulSoup使用 BeautifulSoup 从 <content:encoded> 中检索 url
【发布时间】：2022-02-06 18:35:56
【问题描述】：

我正在努力从 rss 提要中检索到图像的链接。我基本上是在尝试从 'src=' 中获取 url，但我尝试过的所有方法似乎都无法将其绘制出来。

<content:encoded>&lt;h4&gt;Using sklearn’s GridSearchCV on random forest model&lt;/h4&gt;&lt;figure&gt;&lt;img alt="" src="https://cdn-images-1.medium.com/max/1024/1*M-LcJEuYvBjUFh1DhSOicA.jpeg" /&gt;&lt;figcaption&gt;Image by Annie Spratt via Unsplash&lt;/figcaption&gt;&lt;/figure&gt;&lt;p&gt;Finding the optimal tuning parameters for a machine learning problem can often be very difficult. We may encounter &lt;strong&gt;overfitting,&lt;/strong&gt; which means our machine learning model trains too specifically on our training dataset and causes higher levels of error when applied to our test/holdout datasets. Or, we may run into &lt;strong&gt;underfitting,&lt;/strong&gt; which means our model doesn’t train specifically enough to our training dataset. </content:encoded>

下面是我到目前为止一直在尝试的代码。

from bs4 import BeautifulSoup
import requests

resp = requests.get("https://towardsdatascience.com/feed")
soup = BeautifulSoup(resp.content, features='xml')
items = soup.findAll('item')
content_item = {}
content_item['title'] = items[0].title.text
content_item['link'] = items[0].link.text
content_item['Twitter'] = '@TDataScience'
content_item['Media'] = items[0].encoded['src']

与以往一样，我们将非常感谢您提供的任何帮助。

提前致谢。

【问题讨论】：

标签： python html web-scraping beautifulsoup

【解决方案1】：

第一个问题是某些项目没有<content:encoded> 标签，这就是为什么它在尝试访问其内容时返回 NoneType 对象错误的原因。即使它们都具有该标签，您仍然无法获取 url，因为它是 xml 编码的（如其名称所示）。因此，在应用进一步操作之前，您需要使用html.unescape()（或任何其他适合您需要的解码器）对其进行解码：

import requests
import html
from bs4 import BeautifulSoup

resp = requests.get("https://towardsdatascience.com/feed")
soup = BeautifulSoup(resp.content, features='xml')
items = soup.findAll('item')

content_item = {}
for each_item in items[:5]: # using first 5 elements just to test
    content_item['title'] = each_item.title.text
    content_item['link'] = each_item.link.text
    content_item['Twitter'] = '@TDataScience'
    
    if each_item.find('content:encoded'):
        # decode and form the new soup
        decoded_html = BeautifulSoup(html.unescape(each_item.encoded.text), 'lxml')
        
        content_item['Media'] = decoded_html.img["src"]
    else:
        content_item['Media'] = None

    print(content_item)

输出如下：

{'title': '3 Steps to Getting a Job in Data with Zero Experience', 'link': 'https://towardsdatascience.com/3-steps-to-getting-a-job-in-data-with-zero-experience-ccaad96d6477?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': None}
{'title': 'AI for painting: Unraveling Neural Style Transfer', 'link': 'https://towardsdatascience.com/ai-for-painting-unraveling-neural-style-transfer-5ac08a20a580?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': 'https://cdn-images-1.medium.com/max/1024/0*DQt1CKiJSDMrzaWA'}
{'title': 'A Novel Way to Use Batch Normalization', 'link': 'https://towardsdatascience.com/a-novel-way-to-use-batch-normalization-837176d53525?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': 'https://cdn-images-1.medium.com/max/1024/0*rRQ5moh4bTSCY1zR'}
{'title': 'How to Build A Pooled OLS Regression Model For Panel Data Sets', 'link': 'https://towardsdatascience.com/how-to-build-a-pooled-ols-regression-model-for-panel-data-sets-a78358f9c2a?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': 'https://cdn-images-1.medium.com/max/1024/1*nv5gPBul4YsKGctA7b4OZg.png'}
{'title': 'Understanding the native R pipe |>', 'link': 'https://towardsdatascience.com/understanding-the-native-r-pipe-98dea6d8b61b?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': 'https://cdn-images-1.medium.com/max/1024/1*pnjfZFqrY1opjEVNOSjcSg.png'}

注意<content:encoded>标签里面有多个<img>标签，我只是拿第一个作为例子。

【讨论】：