【问题标题】:BeautifulSoup - stripping out non-breaking spaces from HTMLBeautifulSoup - 从 HTML 中去除不间断的空格
【发布时间】:2021-01-21 14:08:25
【问题描述】:

我正在尝试抓取许多 10K 风险因素部分,例如https://www.sec.gov/Archives/edgar/data/1321502/000143774910004615/andain_10k-123106.htm

我的一个问题是我试图完全匹配一些字符串(例如“风险因素”),但有时单词之间有几个不间断的空格

我希望我可以把它们去掉,因为它们对我没有用,所以我尝试了:

url = 'https://www.sec.gov/Archives/edgar/data/1321502/000143774910004615/andain_10k-123106.htm'
page = requests.get(url)
soup = BeautifulSoup(page.text.replace("\xa0"," "), 'html.parser')

然后搜索汤(以通常的方式)以测试它是否有效:

soup.find_all(string="ITEM 1A.\xa0\xa0RISK FACTORS")

但输出仍然包含不间断的空格,这是不应该的:

Out[42]: ['ITEM 1A.\xa0\xa0RISK FACTORS']

我做错了什么?

【问题讨论】:

    标签: python html beautifulsoup


    【解决方案1】:

    试试这个:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://www.sec.gov/Archives/edgar/data/1321502/000143774910004615/andain_10k-123106.htm'
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    cleaned_up = [
        i.getText(strip=True).replace(u"\xa0", " ")
        for i in soup.find_all("font") if i.getText().startswith("ITEM")
    ]
    print(cleaned_up[1])
    
    

    输出:

    ITEM 1A.  RISK FACTORS
    

    【讨论】:

      猜你喜欢
      • 2012-08-10
      • 1970-01-01
      • 1970-01-01
      • 2010-09-10
      • 2014-01-25
      • 1970-01-01
      • 2021-04-25
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多