使用 Beautiful Soup 抓取具体 div 类型中的所有 href答案

【问题标题】：Scraping all hrefs in concrete div types with Beautiful Soup使用 Beautiful Soup 抓取具体 div 类型中的所有 href
【发布时间】：2022-01-06 11:12:17
【问题描述】：

我正在尝试创建一个应用程序，该应用程序允许人们获取与他们提供的搜索关键字相关的 GitHub 存储库列表。在搜索查询的结果页面上，存储库有一个特殊的 div 类，即：

<div class="f4 text-normal">
      </div>

如何让 Beautiful Soup 遍历页面上的所有这些类，然后遍历所有 <a> 标签以搜索 hrefs？

目前我只知道如何从<a>s 获取所有hrefs：

import requests, sys, webbrowser, bs4

#variables
linkList = []


#handle input
print('Your GitHub repository search query:')
userInput = input()

#get the results from the URL

results = requests.get('https://github.com/search?q=' + userInput + '&type=repositories'
                       + ' '.join(sys.argv[1:]))
results.raise_for_status()

soup = bs4.BeautifulSoup(results.text, 'html.parser')

#find all the viable URLs

data = soup.find_all('a')

for aHref in data:
    if "href" in str(aHref):
        linkList.append(aHref)
        
        
print(linkList)

【问题讨论】：

标签： python beautifulsoup

【解决方案1】：

注意： 您的选择不是那么具体，它还会找到其他不期望的链接。

选择更具体的元素并通过列表理解获取其href 属性 - 通过将标签视为字典来访问标签的属性 --> aHref['href]

['https://github.com/'+a['href'] for a in soup.select('.repo-list-item .f4 a[href]')]

示例

import requests, sys, webbrowser, bs4

print('Your GitHub repository search query:')
userInput = input()

results = requests.get('https://github.com/search?q=' + userInput + '&type=repositories'
                       + ' '.join(sys.argv[1:]))
results.raise_for_status()

soup = bs4.BeautifulSoup(results.text, 'html.parser')

linkList = ['https://github.com/'+a['href'] for a in soup.select('.repo-list-item .f4 a[href]')]

输出

['https://github.com//TheAlgorithms/Python',
 'https://github.com//geekcomputers/Python',
 'https://github.com//walter201230/Python',
 'https://github.com//injetlee/Python',
 'https://github.com//kubernetes-client/python',
 'https://github.com//Show-Me-the-Code/python',
 'https://github.com//xxg1413/python',
 'https://github.com//jakevdp/PythonDataScienceHandbook',
 'https://github.com//joeyajames/Python',
 'https://github.com//docker-library/python']

【讨论】：

非常感谢！

【解决方案2】：

这应该让你非常接近。

import requests, sys, webbrowser, bs4

#variables
linkList = []


#handle input
print('Your GitHub repository search query:')
userInput = input()

#get the results from the URL

results = requests.get('https://github.com/search?q=' + userInput + '&type=repositories'
                   + ' '.join(sys.argv[1:]))
results.raise_for_status()

soup = bs4.BeautifulSoup(results.text, 'html.parser')

divs = soup.find_all('div', {'class': 'f4 text-normal'})
for div in divs:
    a_tags = div.find_all('a')
    for a_tag in a_tags:
        try:
            linkList.append(a_tag['href'])
        except:
            continue

# Test
for link in linkList:
    print(link)

【讨论】：

谢谢@Fnatical。不过，下一个回复通过另一种方法得到了我 100% 想要的结果。