【问题标题】:beautifulsoup .get_text() is not specific enough for my HTML parsingbeautifulsoup .get_text() 对我的 HTML 解析不够具体
【发布时间】:2015-10-06 09:07:24
【问题描述】:

鉴于下面的 HTML 代码,我只想输出 h1 的文本,而不是“关于 ' 的详细信息”,它是 span 的文本(由 h1 封装)。

我当前的输出给出:

Details about   New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

我想要:

New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

这是我正在使用的 HTML

<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>

这是我当前的代码:

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    print line.get_text()

注意:我不想只是截断字符串,因为我希望这段代码具有一些可重用性。 最好是一些代码可以裁剪出任何受跨度限制的文本。

【问题讨论】:

    标签: python html regex beautifulsoup


    【解决方案1】:

    一种解决方案是检查字符串是否包含html

    from bs4 import BeautifulSoup
    
    html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
    soup = BeautifulSoup(html, 'html.parser')
    
    for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
        for content in line.contents:
            if bool(BeautifulSoup(str(content), "html.parser").find()):
                continue
    
            print content
    

    另一个解决方案(我更喜欢)是检查 bs4.element.Tag 的实例:

    import bs4
    
    html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
    soup = bs4.BeautifulSoup(html, 'html.parser')
    
    for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
        for content in line.contents:
            if isinstance(content, bs4.element.Tag):
                continue
    
            print content
    

    【讨论】:

      【解决方案2】:

      您可以使用extract() 删除所有span 标签:

      for line in soup.find_all('h1',attrs={'itemprop':'name'}):
          [s.extract() for s in line('span')]
      print line.get_text()
      # => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black
      

      【讨论】:

        猜你喜欢
        • 2020-08-29
        • 2019-03-25
        • 1970-01-01
        • 2019-01-12
        • 2015-07-03
        • 2020-02-06
        • 2014-03-06
        • 2011-07-21
        • 2012-12-13
        相关资源
        最近更新 更多