python beautifulsoup：用字符串中的url替换链接答案

【问题标题】：python beautifulsoup: replace links with url in stringpython beautifulsoup：用字符串中的url替换链接
【发布时间】：2019-09-03 07:47:35
【问题描述】：

在包含 HTML 的字符串中，我有几个要替换为纯 href 值的链接：

from bs4 import BeautifulSoup
a = "<a href='www.google.com'>foo</a> some text <a href='www.bing.com'>bar</a> some <br> text'
soup = BeautifulSoup(html, "html.parser")

tags = soup.find_all()
for tag in tags:
  if tag.has_attr('href'):
    html = html.replace(str(tag), tag['href'])

不幸的是，这会产生一些问题：

html 中的标签使用单引号'，但beautifulsoup 将使用str(tag) 创建一个带有" 引号的标签(<a href="www.google.com">foo</a>)。所以replace() 将找不到匹配项。
<br> 被标识为 <br/>。同样replace() 将找不到匹配项。

所以使用python的replace()方法似乎不会给出可靠的结果。

有没有办法使用beautifulsoup 的方法将标签替换为字符串？

编辑：

str(tag) 的附加值 = <a href="www.google.com">foo</a>

【问题讨论】：

‘和“是可以互换的（只要你以相同的字符结束字符串）并且都代表字符串。
问题在于字符串中的引号。
对我来说，BS 只打印'href' 而不是"href"。

标签： python beautifulsoup

【解决方案1】：

文档相关部分：Modifying the tree

html="""
<html><head></head>
<body>
<a href="www.google.com">foo</a> some text 
<a href="www.bing.com">bar</a> some <br> text
</body></html>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for a_tag in soup.find_all('a'):
    a_tag.string = a_tag.get('href')
print(soup)

输出

<html><head></head>
<body>
<a href="www.google.com">www.google.com</a> some text 
<a href="www.bing.com">www.bing.com</a> some <br/> text
</body></html>

【讨论】：

这是一个好方法。它解决了我的问题。最后我只需要使用soup.text 并获得所需的结果。