【问题标题】:Scraping all hrefs in concrete div types with Beautiful Soup使用 Beautiful Soup 抓取具体 div 类型中的所有 href
【发布时间】:2022-01-06 11:12:17
【问题描述】:

我正在尝试创建一个应用程序,该应用程序允许人们获取与他们提供的搜索关键字相关的 GitHub 存储库列表。在搜索查询的结果页面上,存储库有一个特殊的 div 类,即:

<div class="f4 text-normal">
      </div>

如何让 Beautiful Soup 遍历页面上的所有这些类,然后遍历所有 &lt;a&gt; 标签以搜索 hrefs

目前我只知道如何从&lt;a&gt;s 获取所有hrefs

import requests, sys, webbrowser, bs4

#variables
linkList = []


#handle input
print('Your GitHub repository search query:')
userInput = input()

#get the results from the URL

results = requests.get('https://github.com/search?q=' + userInput + '&type=repositories'
                       + ' '.join(sys.argv[1:]))
results.raise_for_status()

soup = bs4.BeautifulSoup(results.text, 'html.parser')

#find all the viable URLs

data = soup.find_all('a')

for aHref in data:
    if "href" in str(aHref):
        linkList.append(aHref)
        
        
print(linkList)  

  

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    注意: 您的选择不是那么具体,它还会找到其他不期望的链接。

    选择更具体的元素并通过列表理解获取其href 属性 - 通过将标签视为字典来访问标签的属性 --> aHref['href]

    ['https://github.com/'+a['href'] for a in soup.select('.repo-list-item .f4 a[href]')]
    

    示例

    import requests, sys, webbrowser, bs4
    
    print('Your GitHub repository search query:')
    userInput = input()
    
    results = requests.get('https://github.com/search?q=' + userInput + '&type=repositories'
                           + ' '.join(sys.argv[1:]))
    results.raise_for_status()
    
    soup = bs4.BeautifulSoup(results.text, 'html.parser')
    
    linkList = ['https://github.com/'+a['href'] for a in soup.select('.repo-list-item .f4 a[href]')]
    

    输出

    ['https://github.com//TheAlgorithms/Python',
     'https://github.com//geekcomputers/Python',
     'https://github.com//walter201230/Python',
     'https://github.com//injetlee/Python',
     'https://github.com//kubernetes-client/python',
     'https://github.com//Show-Me-the-Code/python',
     'https://github.com//xxg1413/python',
     'https://github.com//jakevdp/PythonDataScienceHandbook',
     'https://github.com//joeyajames/Python',
     'https://github.com//docker-library/python']
    

    【讨论】:

    • 非常感谢!
    【解决方案2】:

    这应该让你非常接近。

    import requests, sys, webbrowser, bs4
    
    #variables
    linkList = []
    
    
    #handle input
    print('Your GitHub repository search query:')
    userInput = input()
    
    #get the results from the URL
    
    results = requests.get('https://github.com/search?q=' + userInput + '&type=repositories'
                       + ' '.join(sys.argv[1:]))
    results.raise_for_status()
    
    soup = bs4.BeautifulSoup(results.text, 'html.parser')
    
    divs = soup.find_all('div', {'class': 'f4 text-normal'})
    for div in divs:
        a_tags = div.find_all('a')
        for a_tag in a_tags:
            try:
                linkList.append(a_tag['href'])
            except:
                continue
    
    # Test
    for link in linkList:
        print(link)
        
    

    【讨论】:

    • 谢谢@Fnatical。不过,下一个回复通过另一种方法得到了我 100% 想要的结果。
    猜你喜欢
    • 2017-08-14
    • 1970-01-01
    • 1970-01-01
    • 2011-11-03
    • 1970-01-01
    • 2019-12-23
    • 1970-01-01
    • 2022-09-30
    • 2023-03-31
    相关资源
    最近更新 更多