如何排除我不想要的某些 beautifulsoup 结果？答案

【问题标题】：How do I exclude certain beautifulsoup results that I don't want?如何排除我不想要的某些 beautifulsoup 结果？
【发布时间】：2021-02-12 06:25:33
【问题描述】：

我在尝试排除我漂亮的汤程序给出的结果时遇到问题，这是我的代码：

from bs4 import BeautifulSoup
import requests

URL = 'https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

for link in soup.find_all('a'):
    print(link.get('href'))

我不想得到以“#”开头的结果，例如：#cite_ref-18

我尝试过使用 for 循环，但收到以下错误消息：KeyError: 0

【问题讨论】：

标签： python beautifulsoup hyperlink python-requests screen-scraping

【解决方案1】：

你可以使用str.startswith()方法：

from bs4 import BeautifulSoup
import requests

URL = 'https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

for tag in soup.find_all('a'):
    link = tag.get('href')
    if not str(link).startswith('#'):
        print(link)

【讨论】：

【解决方案2】：

您可以使用 CSS 选择器a[href]:not([href^="#"])。这将选择所有具有href= 属性的<a> 标签，但不会选择以# 字符开头的标签：

import requests
from bs4 import BeautifulSoup

URL = 'https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

for link in soup.select('a[href]:not([href^="#"])'):
    print(link['href'])

【讨论】：