【问题标题】:Beautiful Soup Child Tags Left Over After Extract提取后留下的美丽汤子标签
【发布时间】:2013-05-15 00:09:04
【问题描述】:

当使用列表理解提取我不想要的标签时,仍然存在一些应该被删除的标签。

import requests, pprint
from bs4 import BeautifulSoup as bs

blacklist = ['a', 'title', 'p', 'input', 'u', 'body', 'html',
         'textarea', 'nobr', 'b', 'span', 'td', 'tr', 
         'br', 'table', 'form', 'img', 'head', 'meta', 
         'script', 'style', 'center',]

soup = bs(requests.get('http://www.google.com').text)

soup = [s.extract() for s in soup() if s.name not in blacklist]

# when printing the tag names, the only show tag is div.
# pprint.pprint( [s.name for s in soup] )

# inside of the divs are tags that we don't want.
pprint.pprint(soup)

输出

[<div id="mngb"></div>,
 <div id="gbar"><nobr><b class="gb1">Search</b> <a class="gb1" href="http://www.google.com/imghp?hl=en&amp;tab=wi">Images</a> <a class="gb1" href="http://maps.google.com/maps?hl=en&amp;tab=wl">Maps</a> <a class="gb1" href="https://play.google.com/?hl=en&amp;tab=w8">Play</a> <a class="gb1" href="http://www.youtube.com/?tab=w1">YouTube</a> <a class="gb1" href="http://news.google.com/nwshp?hl=en&amp;tab=wn">News</a> <a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a> <a class="gb1" href="https://drive.google.com/?tab=wo">Drive</a> <a class="gb1" href="http://www.google.com/intl/en/options/" style="text-decoration:none"><u>More</u> »</a></nobr></div>,
 <div id="guser" width="100%"><nobr><span class="gbi" id="gbn"></span><span class="gbf" id="gbf"></span><span id="gbe"></span><a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a> | <a class="gb4" href="/preferences?hl=en">Settings</a> | <a class="gb4" href="https://accounts.google.com/ServiceLogin?hl=en&amp;continue=http://www.google.com/" id="gb_70" target="_top">Sign in</a></nobr></div>,
 <div class="gbh" style="left:0"></div>,
 <div class="gbh" style="right:0"></div>,
 <div id="lga"><img alt="Google" height="95" id="hplogo" onload="window.lol&amp;&amp;lol()" src="/intl/en_ALL/images/srpr/logo1w.png" style="padding:28px 0 14px" width="275"/><br/><br/></div>,
 <div class="ds" style="height:32px;margin:4px 0"><input autocomplete="off" class="lst" maxlength="2048" name="q" size="57" style="color:#000;margin:0;padding:5px 8px 0 6px;vertical-align:top" title="Google Search" value=""/></div>,
 <div id="gac_scont"></div>,
 <div style="font-size:83%;min-height:3.5em"><br/></div>,
 <div style="font-size:10pt"></div>,
 <div id="fll" style="margin:19px auto;text-align:center"><a href="/intl/en/ads/">Advertising Programs</a><a href="/services/">Business Solutions</a><a href="https://plus.google.com/116899029375914044550" rel="publisher">+Google</a><a href="/intl/en/about.html">About Google</a></div>,
 <div id="xjsd"></div>,
 <div id="xjsi"><script>if(google.y)google.y.first=[];(function(){function b(a){window.setTimeout(function(){var c=document.createElement("script");c.src=a;document.getElementById("xjsd").appendChild(c)},0)}google.dljp=function(a){google.xjsi||(google.xjsu=a,b(a))};google.dlj=b;})();
if(!google.xjs){google.dstr=[];google.rein=[];window._=window._||{};window._._DumpException=function(e){throw e};if(google.timers&amp;&amp;google.timers.load.t){google.timers.load.t.xjsls=new Date().getTime();}google.dljp('/xjs/_/js/k\x3dPxufaYa-26A.en_US./m\x3dsb_he,pcc/rt\x3dj/d\x3d1/sv\x3d1/rs\x3dAItRSTNuFuVo3tYsbamkH3IQObWPur6JEA');google.xjs=1;}google.pmc={"sb":{"agen":true,"cgen":true,"client":"heirloom-hp","dh":true,"ds":"","eqch":true,"fl":true,"host":"google.com","jsonp":true,"msgs":{"lcky":"I\u0026#39;m Feeling Lucky","lml":"Learn more","oskt":"Input tools","psrc":"This search was removed from your \u003Ca href=\"/history\"\u003EWeb History\u003C/a\u003E","psrl":"Remove","sbit":"Search by image","srch":"Google Search"},"ovr":{"l":1,"ms":1},"pq":"","qcpw":false,"scd":10,"sce":5,"stok":"btuwXqiMkjlVCutQ1U6PC2HrVdE"},"hp":{},"pcc":{}};google.y.first.push(function(){if(google.med){google.med('init');google.initHistory();google.med('history');}google.History&amp;&amp;google.History.initialize('/');google.hs&amp;&amp;google.hs.init&amp;&amp;google.hs.init()});if(google.j&amp;&amp;google.j.en&amp;&amp;google.j.xi){window.setTimeout(google.j.xi,0);}</script></div>]

如何删除我不想要的、我想要的标签的子标签?更具体地说,我需要用于所有情况的方法,此代码只是一个简单的示例。

【问题讨论】:

    标签: python tags beautifulsoup


    【解决方案1】:

    试试这个:

    blacklist = ['a', 'title', 'p', 'input', 'u', 'body', 'html','textarea', 'nobr', 'b', 'span', 'td', 'tr', 'br', 'table', 'form', 'img', 'head', 'meta', 'script', 'style', 'center']
    soup = [tag for tag in soup.findAll(True) if tag.name not in blacklist]
    

    【讨论】:

    • 它有什么不同?如果我打印汤,它不包含您要删除的任何标签。
    • 还要考虑到您在标签列表中'html' 之后缺少一个逗号。
    • 在 div 内部,如果你看有 script/img/a/nobr/etc 标签,我希望删除除 div 之外的所有标签作为我的测试。
    【解决方案2】:

    据我了解您想要的结果是什么,我认为您应该通过以文本方式(字面意思)而不是标签和使用 BS 对象迭代 html 文档来删除标签。

    如果对象被视为树,我的意思是分层你如何处理以下情况?

    你想删除每个标签'a'而不是'div',并且文档有这样的树路径:

    <a>
       <div>
           <a>
               <div>
                    text
               </div>
           </a>
       </div>
    </a>
    

    如果您删除最顶层的节点“a”,您也会删除所有经常出现的子节点。 如果您将文档分析为文本,您可能应该从字面上删除所有字符串“”和“”(在该示例中)。所以你应该使用正则表达式来管理它。

    【讨论】:

    • 您可以使用没有正则表达式的 python 字符串方法来解析 html。然而,python 确保了 regex ('re') 的本地模块,并且在其官方文档 (docs.python.org/2/library/re.html) 中进行了描述,Regex 和解析问题超出了您的问题和答案。所以使用 BeautifuSoup 并没有达到预期的效果。
    【解决方案3】:

    玩了一段时间后,我能够提取儿童标签。你需要找到孩子然后提取标签。

    代码如下

    import requests, pprint
    from bs4 import BeautifulSoup as bs
    
    blacklist = ['a', 'title', 'p', 'input', 'u', 'body', 'html',
             'textarea', 'nobr', 'b', 'span', 'td', 'tr', 
             'br', 'table', 'form', 'img', 'head', 'meta', 
             'script', 'style', 'center',]
    
    soup = bs(requests.get('http://www.google.com').text)
    
    # remove all blacklisted tags 
    tags = [tag for tag in soup.find_all(True) if tag.name not in blacklist]
    
    # show tag tree after first extraction
    tag_tree = [(tag.name, [t.name for t in tag.findChildren()]) for tag in tags ]
    for tree in tag_tree: print tree
    
    
    # remove children tags that are blacklisted
    for tag in tags:
        for child in tag.findChildren():
            if child.name in blacklist:
                child.extract()
    
    pprint.pprint(tags)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-02-08
      • 2020-03-17
      • 2017-05-29
      • 1970-01-01
      • 1970-01-01
      • 2018-07-18
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多