【问题标题】:How do I exclude certain beautifulsoup results that I don't want?如何排除我不想要的某些 beautifulsoup 结果?
【发布时间】:2021-02-12 06:25:33
【问题描述】:

我在尝试排除我漂亮的汤程序给出的结果时遇到问题,这是我的代码:

from bs4 import BeautifulSoup
import requests

URL = 'https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

for link in soup.find_all('a'):
    print(link.get('href'))

我不想得到以“#”开头的结果,例如:#cite_ref-18

我尝试过使用 for 循环,但收到以下错误消息:KeyError: 0

【问题讨论】:

    标签: python beautifulsoup hyperlink python-requests screen-scraping


    【解决方案1】:

    你可以使用str.startswith()方法:

    from bs4 import BeautifulSoup
    import requests
    
    URL = 'https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications'
    page = requests.get(URL)
    
    soup = BeautifulSoup(page.content, 'html.parser')
    
    for tag in soup.find_all('a'):
        link = tag.get('href')
        if not str(link).startswith('#'):
            print(link)
    

    【讨论】:

      【解决方案2】:

      您可以使用 CSS 选择器a[href]:not([href^="#"])。这将选择所有具有href= 属性的<a> 标签,但不会选择以# 字符开头的标签:

      import requests
      from bs4 import BeautifulSoup
      
      URL = 'https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications'
      page = requests.get(URL)
      
      soup = BeautifulSoup(page.content, 'html.parser')
      
      for link in soup.select('a[href]:not([href^="#"])'):
          print(link['href'])
      

      【讨论】:

        猜你喜欢
        • 2013-10-21
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-05-08
        • 2016-10-11
        • 1970-01-01
        相关资源
        最近更新 更多