从 beautifulsoup get_text 中删除空白答案

【问题标题】：Remove blanks from beautifulsoup get_text从 beautifulsoup get_text 中删除空白
【发布时间】：2018-02-02 21:27:48
【问题描述】：

我正在拼凑一些东西，尝试使用 beautifulsoup get_text 从网站获取干净的文本。过去我发现它经常带着一些我不需要的东西回来，所以我开始尝试让它尽可能干净。我的问题是，在返回的内容中，我得到了一些空白值。我的代码如下：

def GetPageText():
    for page in GetTeamLinks():
        headers = {'User-Agent': 'Mozilla/5.0'} # some websites look for these sorts of headers to make sure you're not a bot
        response = requests.get(page, verify=False, headers=headers) ##go to each of the websites in the domain list
        soup = BeautifulSoup(response.text, "html.parser") # sets "soup" as their variable name
        for script in soup(["script", "style","a","nav", "footer"]): #find everything in the script or style tags
            script.extract()    # rip it out
        full_text = str(soup.get_text().splitlines()).strip() #set the variable 'full_text' as the text we get back
    return(full_text)

返回的内容是这样的（这是从 https://www.nutmeg.com/about/executive-team 抓取的示例）

['', '', '', '', '', '', '', '', '', 'Executive team | Nutmeg - Nutmeg', 
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 
'', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '      
Executive team', '', '  ', '', '', '', 'The Nutmeg executive team', '', '', 
'', '', '', '', '', '', '', '', 'Martin Stead ', 'Chief executive officer', 
'', 'Martin joined Nutmeg in 2015. He has a range of experience running and 
jointly-running...........]

我想摆脱

 '', '', '', '',

价值观。

我尝试将 full_text 视为一个列表，然后遍历该列表并删除所有小于 2 个字符的值。但是，这在我的 for 语句中似乎不起作用，因为它无法识别全文。

任何帮助将不胜感激。我已经搜索但无法找到答案。如果这里有类似的东西，请指点我的方向。

非常感谢

罗伯

【问题讨论】：

你试过这样.get_text(" ",strip=True) 吗？

标签： python-3.x beautifulsoup

【解决方案1】：

我希望我能理解你的问题。您可以使用列表理解摆脱空值：

my_list = ['', '', '', 'Executive team | Nutmeg - Nutmeg']

new_list = [i for i in my_list if i != '']

print(new_list)

我不知道你以后想对数据做什么，但尝试专门抓取数据以了解你拥有什么似乎更容易。

【讨论】：