【发布时间】:2018-02-02 21:27:48
【问题描述】:
我正在拼凑一些东西,尝试使用 beautifulsoup get_text 从网站获取干净的文本。过去我发现它经常带着一些我不需要的东西回来,所以我开始尝试让它尽可能干净。我的问题是,在返回的内容中,我得到了一些空白值。我的代码如下:
def GetPageText():
for page in GetTeamLinks():
headers = {'User-Agent': 'Mozilla/5.0'} # some websites look for these sorts of headers to make sure you're not a bot
response = requests.get(page, verify=False, headers=headers) ##go to each of the websites in the domain list
soup = BeautifulSoup(response.text, "html.parser") # sets "soup" as their variable name
for script in soup(["script", "style","a","nav", "footer"]): #find everything in the script or style tags
script.extract() # rip it out
full_text = str(soup.get_text().splitlines()).strip() #set the variable 'full_text' as the text we get back
return(full_text)
返回的内容是这样的(这是从 https://www.nutmeg.com/about/executive-team 抓取的示例)
['', '', '', '', '', '', '', '', '', 'Executive team | Nutmeg - Nutmeg',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '
Executive team', '', ' ', '', '', '', 'The Nutmeg executive team', '', '',
'', '', '', '', '', '', '', '', 'Martin Stead ', 'Chief executive officer',
'', 'Martin joined Nutmeg in 2015. He has a range of experience running and
jointly-running...........]
我想摆脱
'', '', '', '',
价值观。
我尝试将 full_text 视为一个列表,然后遍历该列表并删除所有小于 2 个字符的值。但是,这在我的 for 语句中似乎不起作用,因为它无法识别全文。
任何帮助将不胜感激。我已经搜索但无法找到答案。如果这里有类似的东西,请指点我的方向。
非常感谢
罗伯
【问题讨论】:
-
你试过这样
.get_text(" ",strip=True)吗?