【问题标题】:Remove blanks from beautifulsoup get_text从 beautifulsoup get_text 中删除空白
【发布时间】:2018-02-02 21:27:48
【问题描述】:

我正在拼凑一些东西,尝试使用 beautifulsoup get_text 从网站获取干净的文本。过去我发现它经常带着一些我不需要的东西回来,所以我开始尝试让它尽可能干净。我的问题是,在返回的内容中,我得到了一些空白值。我的代码如下:

def GetPageText():
    for page in GetTeamLinks():
        headers = {'User-Agent': 'Mozilla/5.0'} # some websites look for these sorts of headers to make sure you're not a bot
        response = requests.get(page, verify=False, headers=headers) ##go to each of the websites in the domain list
        soup = BeautifulSoup(response.text, "html.parser") # sets "soup" as their variable name
        for script in soup(["script", "style","a","nav", "footer"]): #find everything in the script or style tags
            script.extract()    # rip it out
        full_text = str(soup.get_text().splitlines()).strip() #set the variable 'full_text' as the text we get back
    return(full_text)

返回的内容是这样的(这是从 https://www.nutmeg.com/about/executive-team 抓取的示例)

['', '', '', '', '', '', '', '', '', 'Executive team | Nutmeg - Nutmeg', 
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 
'', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '      
Executive team', '', '  ', '', '', '', 'The Nutmeg executive team', '', '', 
'', '', '', '', '', '', '', '', 'Martin Stead ', 'Chief executive officer', 
'', 'Martin joined Nutmeg in 2015. He has a range of experience running and 
jointly-running...........]

我想摆脱

 '', '', '', '',

价值观。

我尝试将 full_text 视为一个列表,然后遍历该列表并删除所有小于 2 个字符的值。但是,这在我的 for 语句中似乎不起作用,因为它无法识别全文。

任何帮助将不胜感激。我已经搜索但无法找到答案。如果这里有类似的东西,请指点我的方向。

非常感谢

罗伯

【问题讨论】:

  • 你试过这样.get_text(" ",strip=True) 吗?

标签: python-3.x beautifulsoup


【解决方案1】:

我希望我能理解你的问题。 您可以使用列表理解摆脱空值:

my_list = ['', '', '', 'Executive team | Nutmeg - Nutmeg']

new_list = [i for i in my_list if i != '']

print(new_list)

我不知道你以后想对数据做什么,但尝试专门抓取数据以了解你拥有什么似乎更容易。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2016-11-25
    • 2016-01-23
    • 2019-12-20
    • 2011-05-30
    • 2022-01-03
    • 1970-01-01
    • 2022-01-18
    • 1970-01-01
    相关资源
    最近更新 更多