在 LinkedIn 上抓取工作机会时遇到的困难答案

【问题标题】：Difficulties when web scraping job offers on LinkedIn在 LinkedIn 上抓取工作机会时遇到的困难
【发布时间】：2020-11-03 10:53:50
【问题描述】：

一段时间以来，我一直在尝试抓取 LinkedIn 的工作机会部分，但无济于事。顺便说一句，我知道该网站有自己的 API，但我想用 Beautiful Soup 来做这件事，因为我前段时间学过，它是为了练习目的。

这是我的代码：

import requests
from bs4 import BeautifulSoup

client = requests.Session()

HOMEPAGE_URL = 'https://www.linkedin.com'
LOGIN_URL = 'https://www.linkedin.com/login/en'
URL = 'https://www.linkedin.com/jobs/search/?geoId=101174742&keywords=data%20analyst&location=Canada'

html = client.get(HOMEPAGE_URL).content
soup = BeautifulSoup(html, "html.parser")

login_information = {
    'session_key':'<username>',
    'session_password':'<password>',
    'loginCsrfParam': '<csrftoken>',
}
try:
    p = client.post(LOGIN_URL, data=login_information)
    print ("Login Successful")
except:
    print ("Failed to Login")

到这里为止一切顺利。我得到“登录成功”，但是当我询问“状态代码”时，我得到 403：

p.status_code
Output: 403

当然，我不能抓取任何信息。我怎样才能以正确的方式做到这一点？

【问题讨论】：

如何以正确的方式做到这一点？最有可能使用您已经找到的 API。许多网站都有反抓取机制来防止脚本加载它们。我强烈建议不要使用 beautifulsoup 而是使用 API。你也有被屏蔽的机会。

标签： python web-scraping beautifulsoup linkedin http-status-code-403

【解决方案1】：

您实际上不必重新发明轮子。有一个名为，惊喜，惊喜，linkedin-api 的模块，用于通过所谓的Voyager 服务访问各种 LinkedIn 数据（包括工作）。

示例用法：

from linkedin_api import Linkedin

# Authenticate using any Linkedin account credentials
api = Linkedin('reedhoffman@linkedin.com', '*******')

# GET a profile
profile = api.get_profile('billy-g')

# GET a profiles contact info
contact_info = api.get_profile_contact_info('billy-g')

# GET 1st degree connections of a given profile
connections = api.get_profile_connections('1234asc12304')

我之所以分享这个，是因为您可能很难使用旧的 BeautifulSoup 和 requests 来抓取 LinkedIn。另外，请注意，请勿使用您的个人帐户在 LinkedIn 上进行任何抓取活动。

【讨论】：