【问题标题】:How to make BeautifulSoup "understand" the plus html entity如何让 BeautifulSoup “理解” plus html 实体
【发布时间】:2019-09-03 03:25:32
【问题描述】:

假设我们有一个像这样的html 文件:

test.html

<div>
<i>Some text here.</i>
Some text here also.<br>
2 &plus; 4 = 6<br>
2 &lt; 4 = True
</div>

如果我将这个html 传递给BeautifulSoup,它将转义plus 实体附近的&amp; 符号并输出html 将是这样的:

<div>
<i>Some text here.</i>
Some text here also.<br>
2 &amp;plus 4 = 6<br>
2 &lt; 4 = True
</div>

例如python3代码:

from bs4 import BeautifulSoup

with open('test.html', 'rb') as file:
    soup = BeautifulSoup(file, 'html.parser')

print(soup)

如何避免这种行为?

【问题讨论】:

    标签: python html python-3.x beautifulsoup html-parsing


    【解决方案1】:

    阅读不同解析器库的说明:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser:

    这可以解决你的问题:

    s = '''
    <div>
    <i>Some text here.</i>
    Some text here also.<br>
    2 &plus; 4 = 6<br>
    2 &lt; 4 = True
    </div>'''
    
    soup = BeautifulSoup(s, 'html5lib')
    

    你会得到:

    >>> soup
    
    <html><head></head><body><div>
    <i>Some text here.</i>
    Some text here also.<br/>
    2 + 4 = 6<br/>
    2 &lt; 4 = True
    </div></body></html>
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-03-18
      • 2012-03-25
      • 2016-08-04
      • 1970-01-01
      • 2015-06-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多