【问题标题】:How to remove content in nested tags with BeautifulSoup?如何使用 BeautifulSoup 删除嵌套标签中的内容?
【发布时间】:2014-03-12 12:23:51
【问题描述】:

如何删除带有BeautifulSoup 的嵌套标签中的内容?这些帖子显示反向检索嵌套标签中的内容:How to get contents of nested tag using BeautifulSoupBeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?

我试过.text,但它只会删除标签

>>> from bs4 import BeautifulSoup as bs
>>> html = "<foo>Something something <bar> blah blah</bar> something</foo>"
>>> bs(html).find_all('foo')[0]
<foo>Something something <bar> blah blah</bar> something else</foo>
>>> bs(html).find_all('foo')[0].text
u'Something something  blah blah something else'

期望的输出:

其他的东西

【问题讨论】:

  • 所以...在这个例子中你想删除bar的内容?
  • 第二行代码中应该有“else”吗?

标签: python html nested beautifulsoup


【解决方案1】:

您可以检查孩子的bs4.element.NavigableString

from bs4 import BeautifulSoup as bs
import bs4
html = "<foo>Something something <bar> blah blah</bar> something <bar2>GONE!</bar2> else</foo>"
def get_only_text(elem):
    for item in elem.children:
        if isinstance(item,bs4.element.NavigableString):
            yield item

print ''.join(get_only_text(bs(html).find_all('foo')[0]))

输出;

Something something  something  else

【讨论】:

    【解决方案2】:

    例如。

    body = bs(html)
    for tag in body.find_all('bar'):
        tag.replace_with('')
    

    【讨论】:

      【解决方案3】:

      这是我的简单方法,soup.body.clear()soup.tag.clear()

      假设您要清除&lt;table&gt;&lt;/table&gt; 中的内容并添加一个新的pandas 数据框;稍后您可以使用这种清晰的方法轻松地更新您网页的 html 文件中的表格,而不是烧瓶/django:

          import pandas as pd
          import bs4
      

      我想将 120 万行 .csv 转换为 DataFrame,然后转换为 HTML 表格, 然后将其添加到我网页的 html 语法中。后来我想轻松 只要通过简单地切换变量来更新 csv 更新数据

          bizcsv = read_csv("business.csv")
          dframe = pd.DataFrame(bizcsv)
          dfhtml = dframe.to_html #convert DataFrame to table, HTML format
          dfhtml_update = dfhtml_html.strip('<table border="1" class="dataframe">, </table>')
          """use dfhtml_update later to update your table without the <table> tags,
          the <table> is easy for BS to select & clear!"""
      
          #A small function to unescape (&lt; to <) the tags back into HTML format
          def unescape(s):
              s = s.replace("&lt;", "<")
              s = s.replace("&gt;", ">")
              # this has to be last:
              s = s.replace("&amp;", "&")
              return s
      
          with open("page.html") as page:  #return to here when updating
              txt = page.read()
              soup = bs4.BeautifulSoup(txt, features="lxml")
              soup.body.append(dfhtml) #adds table to <body>
              with open("page.html", "w") as outf:
                  outf.write(unescape(str(soup))) #writes to page.html
      
          """lets say you want to make seamless table updates to your 
          webpage instead of using flask or django x_x; return to with open function"""
          soup.table.clear()  #clears everything in <table></table>
          soup.table.append(dfhtml_update)
          with open("page.html", "w") as outf:
              outf.write(unescape(str(soup))) 
      

      我是新手,但经过大量搜索后,我只是结合了文档中的一堆基本教义......有点臃肿,但处理数十亿个数据单元格也是如此。这对我有用

      【讨论】:

        猜你喜欢
        • 2013-10-19
        • 1970-01-01
        • 1970-01-01
        • 2020-02-26
        • 2023-03-27
        • 2014-10-02
        • 2011-06-03
        • 1970-01-01
        • 2021-12-11
        相关资源
        最近更新 更多