【问题标题】:Cannot scrape certain DIV tags using python selenium无法使用 python selenium 抓取某些 DIV 标签
【发布时间】:2019-12-05 18:17:13
【问题描述】:

我正在尝试从该网站中提取有关某个主题(例如机器学习)的文章(标题/链接)。 https://www.semanticscholar.org/search?q=machine%20learning&sort=relevance

我需要访问的 div 标签嵌套在其他几个 div 标签下。

这是我迄今为止尝试过的。我得到空列表。任何帮助表示赞赏。

import time
from selenium import webdriver

# Get all the paper url in the search result
def paper_crawler():
    driver = webdriver.Firefox('path')
    driver.get ('https://www.semanticscholar.org/search?q=machine%20learning&sort=relevance&fos=chemistry')
    result_counts = driver.find_elements_by_xpath('//*[@class="result-count"]')
    print(result_counts)
    for item in result_counts:
        count = item.text
        print(count)
    #search_result_urls = driver.find_elements_by_xpath('.//div[contains(@class,"result-page")]/article/header/div/a')
    search_result_urls = driver.find_elements_by_xpath('//*[@class="result-page"]/article/header/div/a')
    print(search_result_urls)
    for item in search_result_urls:
        paper_url =  item.get_attribute('href')
        print(paper_url)
    search_result_titles = driver.find_elements_by_xpath('//*[@class="result-page"]/article/header/div/a/span')
    for item in search_result_titles:
        paper_title = item.text
        print(paper_title)
    time.sleep(2)

if __name__ == '__main__':
    paper_crawler () 

【问题讨论】:

  • 帖子标题说你正在寻找div标签,但代码正在寻找aspan标签。是哪个?
  • 使用API,让您的生活更轻松!!! semanticscholar.org/api/1/search

标签: python selenium web-scraping


【解决方案1】:

更好地使用API,让您的生活更轻松。解析任何你想要的。

import requests


data = {
    "queryString": "machine learning",
    "page": 1,
    "pageSize": 10,
    "sort": "relevance",
    "authors": [],
    "coAuthors": [],
    "venues": [],
    "yearFilter": None,
    "requireViewablePdf": False,
    "publicationTypes": [],
    "externalContentTypes": []
}
r = requests.post(
    'https://www.semanticscholar.org/api/1/search', json=data).json()

print(r)

【讨论】:

  • 完美无瑕。但是OP只提供了https://www.semanticscholar.org/search?q=machine%20learning&sort=relevance&fos=chemistry的链接,你在哪里找到https://www.semanticscholar.org/api/1/search
  • @DebanjanB 这个问题不应该来自 66k 代表用户:P i.ibb.co/sJpZn3d/Capture.png
  • @DebanjanB 我不是这个答案的 OP,但我通过在网站上搜索时检查网络活动来想象它。
  • @DebanjanB 看起来像,虽然我是 Safari 用户 ;)
  • @DebanjanB 选择 Safari > 首选项,单击高级,然后选择“在菜单栏中显示开发菜单”。然后是网络标签
【解决方案2】:

当您开始搜索元素时,页面已加载但未完全呈现。

之后的“time.sleep(5)”

driver.get('https://www.semanticscholar.org/search?q=machine%20learning&sort=relevance&fos=chemistry') 应该作为一种快速解决方法有所帮助。

要获得更好、更强大的解决方案,您应该等待 result_counts 大于 0 几秒钟或该页面成为错误页面 (https://www.semanticscholar.org/search?q=learning333&sort=relevance&fos=chemistry)。

【讨论】:

    【解决方案3】:

    要提取文章的 TitleHREF 属性,您必须为 visibility_of_all_elements_located() 诱导 WebDriverWait,您可以使用以下命令Locator Strategies:

    • 代码块:

      from selenium import webdriver
      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
      
      options = webdriver.ChromeOptions() 
      options.add_argument("start-maximized")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
      driver.get('https://www.semanticscholar.org/search?q=machine%20learning&sort=relevance&fos=chemistry')
      my_titles = [my_elem.text for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-selenium-selector='title-link']>span")))]
      my_hrefs = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-selenium-selector='title-link']")))]
      for i,j in zip(my_titles, my_hrefs):
          print("{} link is {}".format(i, j))
      driver.quit()
      
    • 控制台输出:

      UCI Repository of Machine Learning Databases link is https://www.semanticscholar.org/paper/UCI-Repository-of-Machine-Learning-Databases-Blake/e068be31ded63600aea068eacd12931efd2a1029
      Energy landscapes for machine learning. link is https://www.semanticscholar.org/paper/Energy-landscapes-for-machine-learning.-Ballard-Das/735d4099d3be0d919ddedb054043e6763205e0f7
      Finding Nature′s Missing Ternary Oxide Compounds Using Machine Learning and Density Functional Theory. link is https://www.semanticscholar.org/paper/Finding-Nature%E2%80%B2s-Missing-Ternary-Oxide-Compounds-Hautier-Fischer/e3ab9e1162fc8f63d215dfdb21801ef5e1fde7b5
      Distributed secure quantum machine learning link is https://www.semanticscholar.org/paper/Distributed-secure-quantum-machine-learning-Sheng-Zhou/ef944614bfc82b1dedfea19ff249a97ceea5ad90
      Neural-Symbolic Machine Learning for Retrosynthesis and Reaction Prediction. link is https://www.semanticscholar.org/paper/Neural-Symbolic-Machine-Learning-for-Retrosynthesis-Segler-Waller/71cc9eefb17d7c4d1062162523b5fdad7ca66a2a
      Transferable Machine-Learning Model of the Electron Density link is https://www.semanticscholar.org/paper/Transferable-Machine-Learning-Model-of-the-Electron-Grisafi-Fabrizio/f809258b65a00a06f9584e76620e6c6395cf81eb
      Crystal structure representations for machine learning models of formation energies link is https://www.semanticscholar.org/paper/Crystal-structure-representations-for-machine-of-Faber-Lindmaa/1bdca98dc8c730ee92d5b19d2973a5bf461a500a
      Machine learning for quantum mechanics in a nutshell link is https://www.semanticscholar.org/paper/Machine-learning-for-quantum-mechanics-in-a-Rupp/29b9ff8f4a26acc90e6182e1e749f15f688bc7cf
      Machine-Learning-Augmented Chemisorption Model for CO2 Electroreduction Catalyst Screening. link is https://www.semanticscholar.org/paper/Machine-Learning-Augmented-Chemisorption-Model-for-Ma-Li/d6f30032c8fac43a8eabf2b67d2e84db6d3d0409
      Adaptive machine learning framework to accelerate ab initio molecular dynamics link is https://www.semanticscholar.org/paper/Adaptive-machine-learning-framework-to-accelerate-Botu-Ramprasad/c9934d684fcc0b8ac6ed25b34d96e726cf2d7b99
      

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-12-08
      • 1970-01-01
      • 2021-12-01
      • 2021-06-07
      • 2019-12-18
      • 1970-01-01
      • 2019-07-19
      • 2020-02-24
      相关资源
      最近更新 更多