如何使用 BeautifulSoup 的输出数据？具体来说，如何拆分 URL 和锚点？答案

【问题标题】：How do I work with output data from BeatifulSoup? Specifically how do I split URL and anchor?如何使用 BeautifulSoup 的输出数据？具体来说，如何拆分 URL 和锚点？
【发布时间】：2019-12-25 12:06:01
【问题描述】：

我有这个测试代码来从网页中提取反向链接。但是我还没有找到一个好的解决方案来专门提取 URL 和锚点，以及标签的附加属性。

请允许我解释得更彻底。假设我有 3 个网页需要检查。 site.com/a/、site.com/b/ 和 site.com/c/。对于每个网页，我都有以下代码输出：

1. [<a data-wpel-link="external" href="https://example.com/" rel="nofollow" target="_blank">anchor-example-1</a>]

2. [<a href="\'https://example.com/\'" rel="\'nofollow\'" target="\'_blank\'">anchor-example-2</a>]

3. [<a href="https://example.com/" rel="nofollow">anchor-example-3</a>]

拆分提取的最佳方法是什么，所以我有以下输出，比如示例 #1？

Linked URL: https://example.com/
Anchor: anchor-example-1
Rel: nofollow

此外，如示例 #2 所示，一些网站倾向于在代码中添加一些垃圾（？）。

href="\'https://example.com/\'"

我如何摆脱诸如 \' 和其他有时可能会损坏输出数据的东西？

from bs4 import BeautifulSoup
import requests
import re

with open('input.txt') as input_data:
    for line in input_data:
        check_url = line.rstrip('\n')
        data = requests.get(check_url, headers={'User-Agent': 'Mozilla/5.0'})
        data.encoding = 'ISO-8859-1'
        soup = BeautifulSoup(str(data.content), 'html.parser')
        backlink = soup.find_all('a', attrs={'href': re.compile('example.com')})
        print('Backlink: ', backlink, '\n')

提前致谢，节日快乐！

【问题讨论】：

这些网址中实际上并不是垃圾。查找转义字符以了解发生这种情况的原因。无论如何，这里有很多你想要做的事情的例子。尝试查找其中一些并进行实验。发布您的尝试，您无法获得它，有人会帮助您。
找出这些字符并使用 str.replace() 删除它们。

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

试试这样的：

from bs4 import BeautifulSoup

html = """<a data-wpel-link="external" href="https://example.com/" rel="nofollow" target="_blank">anchor-example-1</a>
          <a href="\'https://example.com/\'" rel="\'nofollow\'" target="\'_blank\'">anchor-example-2</a>
          <a href="https://example.com/" rel="nofollow">anchor-example-3</a>
       """
soup = BeautifulSoup(html)

for n in soup.find_all('a'):    
    print ('Linked : '+ n.get('href'))
    print ('Rel : '+''.join(n.get('rel')))
    print('Anchor : '+n.text)

结果：

Linked : https://example.com/
Rel : nofollow
Anchor : anchor-example-1
Linked : 'https://example.com/'
Rel : 'nofollow'
Anchor : anchor-example-2
Linked : https://example.com/
Rel : nofollow
Anchor : anchor-example-3

【讨论】：