【问题标题】:BeautifulSoup, Google Scholar, Authors names, affiliations and citations tooBeautifulSoup、Google Scholar、作者姓名、隶属关系和引文
【发布时间】:2015-03-12 12:45:00
【问题描述】:

我想从 Google Scholar 获取所有作者的姓名。我的基本网址是http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security 所以基本上,我会寻找写过任何关于安全的文章的作者。

我使用 BeautifulSoup 编写了一些 Python 脚本,但是(不知道为什么)脚本显示空列表, 因为它没有找到任何给定的元素(但是,当我查看页面源时,我看到了 <div class="gsc_1usr_text"> 元素)。

这是我的代码:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
mydivs = soup.findAll("div", { "class" : "gsc_1usr_text" })
print mydivs

输出为[]print "LEN = " + str(len(mydivs)) 显示为 0。

我在 Linux Mint 13 上使用 Python 2.7.3

【问题讨论】:

  • @AvinashRaj:有趣!你能告诉我你的输出吗?我只有空列表,不知道为什么:(

标签: python beautifulsoup google-scholar


【解决方案1】:

你的代码对我有用。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib2
from bs4 import BeautifulSoup
url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
mydivs = soup.findAll("div", { "class" : "gsc_1usr_text" })
print mydivs

输出:

[<div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=n-Oret4AAAAJ&amp;hl=pl&amp;oe=Latin2">Adrian Perrig</a></h3><div class="gsc_1usr_aff">Professor of Computer Science at ETH Zürich, Adjunct Professor of ECE and EPP at CMU</div><div class="gsc_1usr_eml">Zweryfikowany adres z inf.ethz.ch</div><div class="gsc_1usr_emlb">@inf.ethz.ch</div><div class="gsc_1usr_cby">Cytowane przez 40938</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:networking">Networking</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:operating_systems">Operating Systems</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:computer_security">Computer Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:internet_security">Internet Security</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=HvwPRJ0AAAAJ&amp;hl=pl&amp;oe=Latin2">Vern Paxson</a></h3><div class="gsc_1usr_aff">Professor, EECS, University of California, Berkeley</div><div class="gsc_1usr_eml">Zweryfikowany adres z berkeley.edu</div><div class="gsc_1usr_emlb">@berkeley.edu</div><div class="gsc_1usr_cby">Cytowane przez 39914</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:networking">Networking</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:measurement">Measurement</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=2pW1g5IAAAAJ&amp;hl=pl&amp;oe=Latin2">Mihir Bellare</a></h3><div class="gsc_1usr_aff">Professor, Department of Computer Science and Engineering, UCSD</div><div class="gsc_1usr_eml">Zweryfikowany adres z eng.ucsd.edu</div><div class="gsc_1usr_emlb">@eng.ucsd.edu</div><div class="gsc_1usr_cby">Cytowane przez 35459</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:cryptography">Cryptography</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:complexity_theory">Complexity Theory</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=FCsdj0YAAAAJ&amp;hl=pl&amp;oe=Latin2">Wenyuan Xu</a></h3><div class="gsc_1usr_aff">Assistant Profess of Department of Computer Science and Engineering, University of South  …</div><div class="gsc_1usr_cby">Cytowane przez 32521</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:wireless_networks">Wireless Networks</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:jamming_defenses">jamming defenses</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:dependable_systems">dependable systems</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=vWTI60AAAAAJ&amp;hl=pl&amp;oe=Latin2">Martin Abadi</a></h3><div class="gsc_1usr_aff">Principal Scientist, Google, and Professor Emeritus, UC Santa Cruz</div><div class="gsc_1usr_eml">Zweryfikowany adres z cs.ucsc.edu</div><div class="gsc_1usr_emlb">@cs.ucsc.edu</div><div class="gsc_1usr_cby">Cytowane przez 29938</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:programming_languages_and_systems">programming languages and systems</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:specification_and_verification">specification and verification</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=lOZ1vHIAAAAJ&amp;hl=pl&amp;oe=Latin2">Sushil Jajodia</a></h3><div class="gsc_1usr_aff">University Professor, BDM International Professor, and Director, Center for Secure  …</div><div class="gsc_1usr_eml">Zweryfikowany adres z gmu.edu</div><div class="gsc_1usr_emlb">@gmu.edu</div><div class="gsc_1usr_cby">Cytowane przez 29705</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:privacy">privacy</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:database">database</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:databases">databases</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:distributed_systems">distributed systems</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=Z_enRVYAAAAJ&amp;hl=pl&amp;oe=Latin2">Xiaolan Zhang</a></h3><div class="gsc_1usr_aff">IBM</div><div class="gsc_1usr_eml">Zweryfikowany adres z us.ibm.com</div><div class="gsc_1usr_emlb">@us.ibm.com</div><div class="gsc_1usr_cby">Cytowane przez 27321</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:virtualization">Virtualization</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:systems">Systems</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=W7YBLlEAAAAJ&amp;hl=pl&amp;oe=Latin2">Jean-Pierre Hubaux</a></h3><div class="gsc_1usr_aff">Professor, EPFL</div><div class="gsc_1usr_eml">Zweryfikowany adres z epfl.ch</div><div class="gsc_1usr_emlb">@epfl.ch</div><div class="gsc_1usr_cby">Cytowane przez 24738</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:privacy">Privacy</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:networking">Networking</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=WgyDcoUAAAAJ&amp;hl=pl&amp;oe=Latin2">Ross Anderson</a></h3><div class="gsc_1usr_aff">University of Cambridge</div><div class="gsc_1usr_eml">Zweryfikowany adres z cl.cam.ac.uk</div><div class="gsc_1usr_emlb">@cl.cam.ac.uk</div><div class="gsc_1usr_cby">Cytowane przez 24445</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:cryptology">cryptology</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:dependability">dependability</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:technology_policy">technology policy</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=lsKlsJ8AAAAJ&amp;hl=pl&amp;oe=Latin2">Heejo Lee</a></h3><div class="gsc_1usr_aff">Professor of Computer Science, Korea University</div><div class="gsc_1usr_eml">Zweryfikowany adres z korea.ac.kr</div><div class="gsc_1usr_emlb">@korea.ac.kr</div><div class="gsc_1usr_cby">Cytowane przez 23596</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:network">network</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">security</a> </div></div>]

【讨论】:

  • 这怎么可能?你得到了我需要的所有作者!您使用的是什么操作系统和 Python 版本?无论如何,我应该怎么做才能得到相同的结果? (我的意思是,任何结果,现在它只打印空列表......)
  • 操作系统:Ubuntu 14.04,python 版本 2.7 因为print mydivs 在 3+ 中无法工作
  • 那么我该怎么做呢?即使for e in mydivs : print e 也不起作用,它不打印 div,我有 Python 2.7.3
  • 您是否导入了 import urllib2 模块?
  • 真的不知道是什么原因造成的。在 VM 上安装了 Mint 17,它似乎也可以工作,但我还是想在 Mint 13 上解决这个问题。
【解决方案2】:

您可能发送了太多请求,或者 Google 将您的脚本检测为自动脚本。

您可以尝试做的第一件事是为您的请求添加代理:

#https://docs.python-requests.org/en/master/user/advanced/#proxies

proxies = {
  'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}

或者您可以通过使用requests-htmlselenium 来渲染整个HTML 页面而不使用代理,但您仍然可以获得验证码。

使其工作的代码(我在本地测试了代码):

# If you get an empty array, you get an CAPTCHA from Google.
# Print response to see what cause it.
# Note: code below doesn't do pagination. https://requests-html.kennethreitz.org/#pagination

from requests_html import HTMLSession

session = HTMLSession()
url = 'https://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security'
response = session.get(url)
# https://requests-html.kennethreitz.org/#requests_html.HTML.render
response.html.render(sleep=1)

for author_name in response.html.find('.gs_ai_name'):
    name = author_name.text
    print(name)

输出:

Johnson Thomas
Martin Abadi
Adrian Perrig
Vern Paxson
Frans Kaashoek
Mihir Bellare
Matei Zaharia
Helen J. Wang
Zhu Han
Sushil Jajodia

或者,您可以使用来自 SerpApi 的 Google Scholar Profiles API。这是一个付费 API,可试用 5,000 次搜索。目前正在开发完全免费的试用版。

主要区别在于您不必考虑解决验证码或体验缓慢的抓取过程,因为渲染页面或压力 PC 具有多个实例,例如使用selenium

要集成的代码:

from serpapi import GoogleSearch

params = {
  "engine": "google_scholar_profiles",
  "hl": "en",
  "mauthors": "label:security",
  "api_key": "YOUR_API_KEY"
}

search = GoogleSearch(params)
results = search.get_dict()

for author_name in results['profiles']:
    name = author_name['name']
    print(name)

输出:

Johnson Thomas
Martin Abadi
Adrian Perrig
Vern Paxson
Frans Kaashoek
Mihir Bellare
Matei Zaharia
Helen J. Wang
Zhu Han
Sushil Jajodia

部分 JSON 输出:

"profiles": [
  {
    "name": "Johnson Thomas",
    "link": "https://scholar.google.com/citations?hl=en&user=eKLr0EgAAAAJ",
    "serpapi_link": "https://serpapi.com/search.json?author_id=eKLr0EgAAAAJ&engine=google_scholar_author&hl=en",
    "author_id": "eKLr0EgAAAAJ",
    "affiliations": "Professor of Computer Science, Oklahoma State University",
    "email": "Verified email at cs.okstate.edu",
    "cited_by": 150263,
    "interests": [
      {
        "title": "Security",
        "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Asecurity",
        "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:security"
      }
    ]
  }
]

免责声明,我为 SerpApi 工作。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-06-14
    • 2021-07-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多