【发布时间】:2019-12-05 18:17:13
【问题描述】:
我正在尝试从该网站中提取有关某个主题(例如机器学习)的文章(标题/链接)。 https://www.semanticscholar.org/search?q=machine%20learning&sort=relevance
我需要访问的 div 标签嵌套在其他几个 div 标签下。
这是我迄今为止尝试过的。我得到空列表。任何帮助表示赞赏。
import time
from selenium import webdriver
# Get all the paper url in the search result
def paper_crawler():
driver = webdriver.Firefox('path')
driver.get ('https://www.semanticscholar.org/search?q=machine%20learning&sort=relevance&fos=chemistry')
result_counts = driver.find_elements_by_xpath('//*[@class="result-count"]')
print(result_counts)
for item in result_counts:
count = item.text
print(count)
#search_result_urls = driver.find_elements_by_xpath('.//div[contains(@class,"result-page")]/article/header/div/a')
search_result_urls = driver.find_elements_by_xpath('//*[@class="result-page"]/article/header/div/a')
print(search_result_urls)
for item in search_result_urls:
paper_url = item.get_attribute('href')
print(paper_url)
search_result_titles = driver.find_elements_by_xpath('//*[@class="result-page"]/article/header/div/a/span')
for item in search_result_titles:
paper_title = item.text
print(paper_title)
time.sleep(2)
if __name__ == '__main__':
paper_crawler ()
【问题讨论】:
-
帖子标题说你正在寻找
div标签,但代码正在寻找a和span标签。是哪个? -
使用
API,让您的生活更轻松!!! semanticscholar.org/api/1/search
标签: python selenium web-scraping