【问题标题】:How to filter out unwanted tags within tags如何过滤掉标签中不需要的标签
【发布时间】:2016-05-10 15:23:04
【问题描述】:

我正在尝试从这里的页面中提取项目符号列表:http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/

具体来说,在下面的屏幕截图中以黄色突出显示的项目符号。

首先,我用美汤过滤掉所有没有属性的<ul>标签:

text = BeautifulSoup(requests.get('http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/', timeout=7.00).text)
bullets = text.find_all(lambda tag: tag.name == 'ul' and not tag.attrs) 

这里是返回的两个<ul> 标签:

<ul>
<li>You are experiencing a decrease in sales and customers</li>
<li>If your brand design does not reflect what you deliver</li>
<li>If you want to attract a new target audience</li>
<li>Management change</li>
<li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
</ul>

<ul><li class="share-item share-fb" data-title="What is Causing your Headaches?- Startup Pain Points" data-type="facebook" data-url="http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/" title="Facebook"></li><li class="share-item share-tw" data-title="What is Causing your Headaches?- Startup Pain Points" data-type="twitter" data-url="http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/" title="Twitter"></li><li class="share-item share-gp" data-lang="en-US" data-title="What is Causing your Headaches?- Startup Pain Points" data-type="googlePlus" data-url="http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/" title="Google+"></li><li class="share-item share-pn" data-media="http://bodetree.com/wp-content/uploads/2015/04/pain-points.png" data-title="What is Causing your Headaches?- Startup Pain Points" data-type="pinterest" data-url="http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/" title="Pinterest"></li></ul>

我只想提取页面正文中出现的&lt;ul&gt; 标签,所以我想过滤掉第二个带有垃圾的&lt;ul&gt; 标签。似乎没有出现在页面正文中的&lt;ul&gt; 标记具有带有属性的&lt;li&gt; 标记,因此我们可以根据它进行过滤。基本上我想要的只是&lt;ul&gt;&lt;li&gt;string&lt;/li&gt;&lt;/ul&gt; 形式的标签结构。所以在这种情况下,我想要返回的唯一 &lt;ul&gt; 是:

<ul> 
<li>You are experiencing a decrease in sales and customers</li> 
<li>If your brand design does not reflect what you deliver</li> 
<li>If you want to attract a new target audience</li> 
<li>Management change</li> 
<li>19 Questions to Ask Yourself Before You Start Rebranding</li>
</ul> 

有没有办法用 find_all() 做到这一点?

【问题讨论】:

    标签: python tags beautifulsoup


    【解决方案1】:

    在文章中搜索ul,即divclass="entry-content"

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(requests.get('http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/', timeout=7.00).text)
    
    bullets = soup.select("div.entry-content ul li")
    print([bullet.get_text() for bullet in bullets])
    

    打印:

    [
        'You are experiencing a decrease in sales and customers', 
        'If your brand design does not reflect what you deliver', 
        'If you want to attract a new target audience', 
        'Management change', 
        '19 Questions to Ask Yourself Before You Start Rebranding'
    ]
    

    【讨论】:

    • 我想合并
        之前的段落以及借用上下文。我希望这一切都是一个字符串。所以我做 bullets = text.select("div.entry-content ul") 然后 uls_with_ps = [(ul.findPrevious('p'), ul) for ul in bullets] 并打印 [j.get_text() for j在 uls_with_ps] 中,虽然我收到一条错误消息,提示“'tuple' object has no attribute 'get_text'”。有没有简单的方法来做到这一点?
    • @MikaSchiller 获取前一段,找到ul 元素并使用find_previous_sibling("p") 获取前一段。要使其全部成为一个字符串,请使用join(),例如:" ".join([bullet.get_text() for bullet in bullets])
    猜你喜欢
    • 2015-06-26
    • 1970-01-01
    • 2016-12-24
    • 2010-09-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-06-19
    • 1970-01-01
    相关资源
    最近更新 更多