【发布时间】:2021-12-27 22:09:07
【问题描述】:
所以过去一周我一直在处理这部分代码,并且我设法更好地理解了代码逻辑。让你知道:我正在尝试在 crunchbase 上为每家电动汽车公司收集创始人信息(姓名、性别、学校信息)。我已经想出我应该做的方法是创建不同的字典,因为某些信息位于页面的不同部分。下面是代码:
#imports
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import TimeoutException
import pandas as pd
import time
#driver path
PATH = "C:/Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
#access crunchbase ui
driver.get("https://www.crunchbase.com/search/organizations/field/organization.companies/categories/electric-vehicle")
driver.maximize_window()
time.sleep(5)
print(driver.title)
time.sleep(3)
#await element location
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, ('//a[@aria-label="Next"][@aria-disabled="false"][@type="button"]'))))
#next page
page = driver.find_element_by_xpath('/html/body/chrome/div/mat-sidenav-container/mat-sidenav-content/div/search/page-layout/div/div/form/div[2]/results/div/div/div[1]/div/results-info/h3/a[2]')
company_list = [] ###create dictionary
counter = 0
for _ in range(2):
if counter == 1:
break
counter += 1
if page.is_displayed():
time.sleep(25)
#webscrape through iterations/rows
all_rows = driver.find_elements_by_css_selector("grid-row")
for row in all_rows:
companyname = row.find_element_by_xpath('.//*[@class="identifier-label"]')
companyname.click()
time.sleep(10)
###founder info
founders = driver.find_element_by_css_selector("body > chrome > div > mat-sidenav-container > mat-sidenav-content > div > ng-component > entity-v2 > page-layout > div > div > div > page-centered-layout:nth-child(3) > div > div > div.main-content > row-card:nth-child(2) > profile-section > section-card > mat-card > div.section-content-wrapper > div > fields-card:nth-child(1) > ul > li:nth-child(4) > field-formatter > identifier-multi-formatter > span > a")
ActionChains(driver).move_to_element(founders).perform()
founders.click()
f1 = {
'founder name': driver.find_element_by_xpath('.//*[@class="profile-name"]').text.strip(),
'founder gender': driver.find_element_by_css_selector('body > chrome > div > mat-sidenav-container > mat-sidenav-content > div > ng-component > entity-v2 > page-layout > div > div > div > page-centered-layout.ng-star-inserted > div > div > div.main-content > row-card:nth-child(1) > profile-section > section-card > mat-card > div.section-content-wrapper > div > fields-card:nth-child(3) > ul > li:nth-child(3) > field-formatter > span').text.strip(),
}
fschool = driver.find_element_by_css_selector('body > chrome > div > mat-sidenav-container > mat-sidenav-content > div > ng-component > entity-v2 > page-layout > div > div > div > page-centered-layout.ng-star-inserted > div > div > div.main-content > row-card:nth-child(7) > profile-section > section-card > mat-card > div.section-content-wrapper > div > image-list-card > ul > li > div > field-formatter:nth-child(5) > span')
ActionChains(driver).move_to_element(fschool).perform()
f2 = {
'school': driver.find_element_by_css_selector('body > chrome > div > mat-sidenav-container > mat-sidenav-content > div > ng-component > entity-v2 > page-layout > div > div > div > page-centered-layout.ng-star-inserted > div > div > div.main-content > row-card:nth-child(7) > profile-section > section-card > mat-card > div.section-content-wrapper > div > image-list-card > ul > li > div > a').text.strip(),
'degree type': driver.find_element_by_css_selector('body > chrome > div > mat-sidenav-container > mat-sidenav-content > div > ng-component > entity-v2 > page-layout > div > div > div > page-centered-layout.ng-star-inserted > div > div > div.main-content > row-card:nth-child(7) > profile-section > section-card > mat-card > div.section-content-wrapper > div > image-list-card > ul > li > div > field-formatter:nth-child(2) > span').text.strip(),
'degree': driver.find_element_by_css_selector('body > chrome > div > mat-sidenav-container > mat-sidenav-content > div > ng-component > entity-v2 > page-layout > div > div > div > page-centered-layout.ng-star-inserted > div > div > div.main-content > row-card:nth-child(7) > profile-section > section-card > mat-card > div.section-content-wrapper > div > image-list-card > ul > li > div > field-formatter:nth-child(3) > span').text.strip()
}
f3 = {**f1, **f2}
print(f3)
company_list.append(f1)
print("next")
page.click()
#create dataframe
df = pd.DataFrame(company_list)
print(df)
#create excel writer object
writer=pd.ExcelWriter('crunchbasedemo.xlsx')
#export to excel
df.to_excel(writer)
writer.save()
print("It's alive!")
由于某种原因,f3(合并的 f1 和 f2 字典)无法打印,当我到达打印点时,我一直收到此错误:
StaleElementReferenceException: stale element reference: element is not attached to the page document
有什么想法吗?
编辑代码:
hrefs=[x.get_attribute('href') for x in driver.find_elements_by_xpath('//a[@class="component--field-formatter field-type-identifier link-accent ng-star-inserted"]')]
names=[x.get_attribute('title') for x in driver.find_elements_by_xpath('//a[@class="component--field-formatter field-type-identifier link-accent ng-star-inserted"]')]
print(names)
print(hrefs)
company_list=[]
for href in hrefs:
driver.get(href)
try:
founders=[x.get_attribute('href') for x in driver.find_elements_by_xpath("//li[@class='ng-star-inserted' and contains(.,'Founders')]//a[@class='link-accent ng-star-inserted']")]
founder_names = [x.get_attribute('title') for x in driver.find_elements_by_xpath("//li[@class='ng-star-inserted' and contains(.,'Founders')]//a[@class='link-accent ng-star-inserted']")]
print(founder_names)
for founder in founders:
driver.get(founder)
try:
fschool = driver.find_elements_by_xpath("(//li[@class='ng-star-inserted']//a[@class='link-accent'])[5]")
ActionChains(driver).move_to_element(fschool).perform()
print(fschool)
except:
pass
except:
pass
【问题讨论】:
-
f3(合并的 f1 和 f2 字典)在哪里?字典应该是
company_list = {} -
@DebanjanB 在代码中向下滚动
-
每次点击可能会对 DOM 产生一些影响。这将使您的“all_rows”引用过时。
-
有什么建议吗? @pcalkins
-
我建议先获取 hrefs,然后 driver.get 到那些 hrefs。
标签: python html pandas selenium selenium-webdriver