使用 python 抓取图表的数据答案

【问题标题】：Webscraping the data of a graph using python使用 python 抓取图表的数据
【发布时间】：2020-11-09 09:05:40
【问题描述】：

我想抓取可以在此webpage 上找到的图表数据。为此，我在 Python (Pycharm) 中使用 Selenium。到目前为止，这是我的代码：

from selenium import webdriver
mozilla_path = r"C:\Users\ivrav\Python38\geckodriver.exe"
driver = webdriver.Firefox()
driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
driver.maximize_window()
Researcher=driver.find_element_by_xpath("""//*[@id="gsc_rsb_cit"]/div/div[3]/div""") .click()
Graph=driver.find_elements_by_id("gsc_md_hist_b")
print(Graph.text)

代码工作正常，直到它必须从图表中获取信息（年份和每年的引用），回复是没有要刮的文本。您能否给我一些关于如何抓取所需信息的想法？

提前非常感谢，伊万

【问题讨论】：

您也可以直接查找 <span> 的类 .gsc_g_t 多年来，而引用计数在 <span class="gsc_g_al"> </span>。

标签： python selenium selenium-webdriver xpath webdriverwait

【解决方案1】：

要提取年份的信息，您必须为WebDriverWait 诱导visibility_of_element_located()，您可以使用以下任一Locator Strategies：

使用XPATH：

driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@id='gsc_rsb_cit']//div[@class='gsc_md_hist_w']/div[@class='gsc_md_hist_b']"))).click()
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@id='gsc_md_hist_c']//div[@class='gsc_md_hist_w']/div[@class='gsc_md_hist_b']//span[@class='gsc_g_t']")))])

控制台输出：

['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020']

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

【讨论】：

非常感谢@DebanjanB！实际上，我抓取了这些年，但在抓取条形信息（引用次数）时遇到了问题。您对实现这一目标有什么建议吗？再次，非常感谢伊万
@Iván 这个答案是根据你的代码试验构建的。是的，我也有引用次数的解决方案，但恐怕你有与您的代码试验一起提出新票。提示：您必须鼠标悬停。
@Iván 如果有任何 answer 满足您的问题，请点击空心处accept answer我的 answer 旁边的复选标记位于 votedown 箭头下方，因此复选标记变为绿色。

【解决方案2】：

您可以尝试使用带有类属性的 xpath 并将所有跨度测试作为列表获取。请检查以下未经测试的代码：

from selenium import webdriver
mozilla_path = r"C:\Users\ivrav\Python38\geckodriver.exe"
driver = webdriver.Firefox()
driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
driver.maximize_window()
Researcher=driver.find_element_by_xpath("""//*[@id="gsc_rsb_cit"]/div/div[3]/div""") .click()
#Graph=driver.find_elements_by_id("gsc_md_hist_b")
#Graph=driver.find_elements_by_xpath('//div[@class=".gsc_md_hist_b"]//span[@class=".gsc_g_t"]')
Graph=driver.find_elements_by_xpath("//span[@class='gsc_g_t']")

for spanText in Graph:
    print(spanText.text)

BarValue=driver.find_elements_by_xpath("//span[@class='gsc_g_al']")
for barValueText in BarValue:
        print(barValueText.text)

【讨论】：

非常感谢，Ashish Karn！你知道我怎样才能刮掉条上的信息（引用次数）吗？我很难抓取这些信息。非常感谢，伊万