【问题标题】:Retriveing url from within <content:encoded> using BeautifulSoup使用 BeautifulSoup 从 <content:encoded> 中检索 url
【发布时间】:2022-02-06 18:35:56
【问题描述】:

我正在努力从 rss 提要中检索到图像的链接。我基本上是在尝试从 'src=' 中获取 url,但我尝试过的所有方法似乎都无法将其绘制出来。

&lt;content:encoded&gt;&amp;lt;h4&amp;gt;Using sklearn’s GridSearchCV on random forest model&amp;lt;/h4&amp;gt;&amp;lt;figure&amp;gt;&amp;lt;img alt="" src="https://cdn-images-1.medium.com/max/1024/1*M-LcJEuYvBjUFh1DhSOicA.jpeg" /&amp;gt;&amp;lt;figcaption&amp;gt;Image by Annie Spratt via Unsplash&amp;lt;/figcaption&amp;gt;&amp;lt;/figure&amp;gt;&amp;lt;p&amp;gt;Finding the optimal tuning parameters for a machine learning problem can often be very difficult. We may encounter &amp;lt;strong&amp;gt;overfitting,&amp;lt;/strong&amp;gt; which means our machine learning model trains too specifically on our training dataset and causes higher levels of error when applied to our test/holdout datasets. Or, we may run into &amp;lt;strong&amp;gt;underfitting,&amp;lt;/strong&amp;gt; which means our model doesn’t train specifically enough to our training dataset. &lt;/content:encoded&gt;

下面是我到目前为止一直在尝试的代码。

from bs4 import BeautifulSoup
import requests

resp = requests.get("https://towardsdatascience.com/feed")
soup = BeautifulSoup(resp.content, features='xml')
items = soup.findAll('item')
content_item = {}
content_item['title'] = items[0].title.text
content_item['link'] = items[0].link.text
content_item['Twitter'] = '@TDataScience'
content_item['Media'] = items[0].encoded['src']

与以往一样,我们将非常感谢您提供的任何帮助。

提前致谢。

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:

    第一个问题是某些项目没有&lt;content:encoded&gt; 标签,这就是为什么它在尝试访问其内容时返回 NoneType 对象错误的原因。即使它们都具有该标签,您仍然无法获取 url,因为它是 xml 编码的(如其名称所示)。因此,在应用进一步操作之前,您需要使用html.unescape()(或任何其他适合您需要的解码器)对其进行解码:

    import requests
    import html
    from bs4 import BeautifulSoup
    
    resp = requests.get("https://towardsdatascience.com/feed")
    soup = BeautifulSoup(resp.content, features='xml')
    items = soup.findAll('item')
    
    content_item = {}
    for each_item in items[:5]: # using first 5 elements just to test
        content_item['title'] = each_item.title.text
        content_item['link'] = each_item.link.text
        content_item['Twitter'] = '@TDataScience'
        
        if each_item.find('content:encoded'):
            # decode and form the new soup
            decoded_html = BeautifulSoup(html.unescape(each_item.encoded.text), 'lxml')
            
            content_item['Media'] = decoded_html.img["src"]
        else:
            content_item['Media'] = None
    
        print(content_item)
    

    输出如下:

    {'title': '3 Steps to Getting a Job in Data with Zero Experience', 'link': 'https://towardsdatascience.com/3-steps-to-getting-a-job-in-data-with-zero-experience-ccaad96d6477?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': None}
    {'title': 'AI for painting: Unraveling Neural Style Transfer', 'link': 'https://towardsdatascience.com/ai-for-painting-unraveling-neural-style-transfer-5ac08a20a580?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': 'https://cdn-images-1.medium.com/max/1024/0*DQt1CKiJSDMrzaWA'}
    {'title': 'A Novel Way to Use Batch Normalization', 'link': 'https://towardsdatascience.com/a-novel-way-to-use-batch-normalization-837176d53525?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': 'https://cdn-images-1.medium.com/max/1024/0*rRQ5moh4bTSCY1zR'}
    {'title': 'How to Build A Pooled OLS Regression Model For Panel Data Sets', 'link': 'https://towardsdatascience.com/how-to-build-a-pooled-ols-regression-model-for-panel-data-sets-a78358f9c2a?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': 'https://cdn-images-1.medium.com/max/1024/1*nv5gPBul4YsKGctA7b4OZg.png'}
    {'title': 'Understanding the native R pipe |>', 'link': 'https://towardsdatascience.com/understanding-the-native-r-pipe-98dea6d8b61b?source=rss----7f60cf5620c9---4', 'Twitter': '@TDataScience', 'Media': 'https://cdn-images-1.medium.com/max/1024/1*pnjfZFqrY1opjEVNOSjcSg.png'}
    

    注意&lt;content:encoded&gt;标签里面有多个&lt;img&gt;标签,我只是拿第一个作为例子。

    【讨论】:

      猜你喜欢
      • 2013-06-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-01-25
      • 2011-03-21
      • 2021-09-08
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多