您可以使用str.join 对soup.contents 进行迭代:
import bs4
html = '''<div>Some TEXT with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with following <a href="https//- next Link">some LINK</a> inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents)
输出:
'Some TEXT with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'
编辑:忽略br标签:
html = '''<div>Some TEXT <br> with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with <br> following <a href="https//- next Link">some LINK</a> inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents \
if getattr(i, 'name', None) != 'br')
编辑 2:递归解决方案:
def form_text(s):
if isinstance(s, (str, bs4.element.NavigableString)):
yield s
elif s.name == 'a':
yield f'{s.get_text(strip=True)} ({s["href"]})'
else:
for i in getattr(s, 'contents', []):
yield from form_text(i)
html = '''<div>Some TEXT <i>other text in </i> <br> with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with <br> following <a href="https//- next Link">some LINK</a> inside.</div>'''
print(' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))
输出:
Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.
此外,由于br 标签等的存在,空格可能会成为问题。要解决此问题,您可以使用re.sub:
import re
result = re.sub('\s+', ' ', ' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))
输出:
'Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'