【问题标题】:How to use Selenium in Python to scrape contributors names from github page url如何在 Python 中使用 Selenium 从 github 页面 url 中抓取贡献者姓名
【发布时间】:2020-06-17 20:05:08
【问题描述】:
我正在尝试抓取 github 项目比特币的特定链接 (https://github.com/bitcoin/bitcoin/blob/master/.gitignore) 的贡献者。我正在使用硒。我正在抓取写有“44 个贡献者”的贡献者。只有当我手动转到页面并单击“44contributors”时,我的代码才会抓取贡献者的姓名。否则,它会导致贡献者列表为空。请帮助我使我的代码工作,即使没有转到页面并单击“44 个贡献者”。下面是点击前后的页面截图:
from selenium import webdriver
from selenium import webdriver
driver = webdriver.Chrome(r'C:\Users\saran\chromedriver_win32\chromedriver.exe')
driver.get('https://github.com/bitcoin/bitcoin/blob/master/.gitignore')
contributors=driver.find_elements_by_css_selector('div.Box-body.d-flex.flex-items-center.flex-auto.f6.border-bottom-0.flex-wrap >\
details#blob_contributors_box >\
details-dialog >\
ul >li.Box-row > a.link-gray-dark.no-underline')
contri_names=[]
for n in contributors:
contri_names.append(n.get_attribute('innerText'))
【问题讨论】:
标签:
python
selenium
web-scraping
【解决方案1】:
我已经从贡献者的个人资料链接中提取了姓名
#importing libraries
import requests
import os
import json
from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
#opening a chrome instance
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r"C:/selenium/chromedriver.exe")
#getting to the link
driver.get('https://github.com/bitcoin/bitcoin/blob/master/.gitignore')
#opening the names of the contributors
driver.find_element_by_xpath('//*[@id="blob_contributors_box"]').click()
#getting the elements
names=driver.find_elements_by_xpath('//*[@id="blob_contributors_box"]/details-dialog/ul/li/a')
#getting the links of the contibutors page
ids=[]
for name in names:
ids +=[name.get_attribute('href')
for link in driver.find_elements_by_xpath('//*[@id="blob_contributors_box"]/details-dialog/ul/li[1]/a')]
#getting the name from the links
ppl_names=[]
for id in ids:
ppl_name=id.replace('https://github.com/','')
ppl_names.append(ppl_name)
#print names
ppl_names