Python Selenium，抓取LinkedIn：循环工作和教育历史答案

【问题标题】：Python Selenium, Scraping LinkedIn: Looping through Work and Education HistoriesPython Selenium，抓取LinkedIn：循环工作和教育历史
【发布时间】：2021-08-27 21:35:25
【问题描述】：

我正在使用 Selenium 从 Python 中的 LinkedIn 个人资料中抓取数据。它大部分都在工作，但我不知道如何在他们的历史部分中为每个雇主或学校提取信息。

我正在学习以下教程：https://www.linkedin.com/pulse/how-easy-scraping-data-from-linkedin-profiles-david-craven/

我正在查看此个人资料：https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk

这是我正在努力解决的 HTML 部分的部分 sn-p：

<section id="experience-section" class="pv-profile-section experience-section ember-view"><header class="pv-profile-section__card-header">
  <h2 class="pv-profile-section__card-heading">
    Experience
  </h2>

<!----></header>

  <ul class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-more">
<li id="ember136" class="pv-entity__position-group-pager pv-profile-section__list-item ember-view">        <section id="1762786165" class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view">  <div class="display-flex justify-space-between full-width">
    <div class="display-flex flex-column full-width">
<a data-control-name="background_details_company" href="/company/wagestream/" id="ember138" class="full-width ember-view">          <div class="pv-entity__logo company-logo">
  <img src="https://media-exp1.licdn.com/dms/image/C560BAQEkzVWoORqWFQ/company-logo_100_100/0/1615996325297?e=1631145600&amp;v=beta&amp;t=SoZQKV09PqqYxYTzbjqV4XTJa7HkGUZRe4QT0jU5hmE" loading="lazy" alt="Wagestream" id="ember140" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image ember-view">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section ">
  <h3 class="t-16 t-black t-bold">Senior Software Engineer</h3>
  <p class="visually-hidden">Company Name</p>
  <p class="pv-entity__secondary-title t-14 t-black t-normal">
      Wagestream
        <span class="pv-entity__secondary-title separator">Full-time</span>
  </p>
    <div class="display-flex">
    <h4 class="pv-entity__date-range t-14 t-black--light t-normal">
      <span class="visually-hidden">Dates Employed</span>
      <span>Apr 2021 – Present</span>
    </h4>
      <h4 class="t-14 t-black--light t-normal">
        <span class="visually-hidden">Employment Duration</span>
        <span class="pv-entity__bullet-item-v2">3 mos</span>
      </h4>
  </div>

  <h4 class="pv-entity__location t-14 t-black--light t-normal block">
    <span class="visually-hidden">Location</span>
    <span>London, England, United Kingdom</span>
  </h4>
<!---->
</div>

</a>
<!---->    </div>

<!---->  </div>
</section>

这之后是更多的“li”部分。所以整个历史部分可以用 id="experience-section" 来标识，工作（相对于教育）历史可以在 "ul" 部分 class="pv-profile-section__section-info section-info pv-profile 中标识-section__section-info--has-more”。列表中第一个作业的信息可以用 "li" section id="ember136" 来标识。

我正在尝试从这部分获取职位名称、公司、工作年限等，但不知道该怎么做。这是一些 python 代码来显示我尝试过的内容（跳过我的登录）：

from parsel import Selector
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests

path = r'C:\Program Files (x86)\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)

# driver.get method() will navigate to a page given by the URL address
driver.get('https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk')

text=driver.page_source
sel = Selector(text) 

# Using the "Copy xPath" option in Inspect in Google Chrome, I can manually extract the company name
sel.xpath('//*[@id="ember187"]/div[2]/p[2]/text()').extract_first()  

# This will give me all of the text in the Work Experience section
stuff = driver.find_element_by_id("experience-section")
items = html_list.find_elements_by_tag_name("ul")
items = html_list.find_elements_by_tag_name("h3")
for item in items:
    print(type(item))
    text = item.text
    print(text)

但这些方法对于从个人资料中的每项工作中自动系统地提取信息并不适用。我想做的是像循环遍历每个“ul”部分中的“li”部分，并在“li”部分中，仅提取公司名称 class= "pv-entity__secondary-title t-14 t-black t-正常”。但是 find_element_by_class_name 只产生 NoneTypes。

我不确定如何在概念上使用 selenium 生成“ul”和“li”的可迭代列表，并在每次迭代中使用类名提取特定的文本位。

【问题讨论】：

标签： python html selenium web-scraping linkedin

【解决方案1】：

这是我想出的解决方案。我应该指出我在以下教程的 YouTube 评论中“交叉发布”：https://www.youtube.com/watch?v=W4Md-koupmE

运行整个代码，但替换您的电子邮件和密码。

首先，打开浏览器，登录 LinkedIn，然后导航到相关的个人资料

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
from time import sleep

# Path to the chromedriver.exe
path = r'C:\Program Files (x86)\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)

driver.get('https://www.linkedin.com')

# Log into LinkedIn
username = driver.find_element_by_id('session_key')
username.send_keys('mail@mail.com')

sleep(0.5)

password = driver.find_element_by_id('session_password')
password.send_keys('password')

sleep(0.5)

log_in_button = driver.find_element_by_class_name('sign-in-form__submit-button')
log_in_button.click()

sleep(3)

# The example profile I am trying to scrape
driver.get('https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk')
sleep(3)

如果我只是开始尝试提取内容，我会收到错误消息。事实证明，我需要向下滚动到相关部分才能加载，否则不会创建任何数据：

# The experience section doesn't load until you scroll to it, this will scroll to the section
l= driver.find_element_by_xpath('//*[@id="oc-background-section"]')
driver.execute_script("arguments[0].scrollIntoView(true);", l)

要遍历工作经验，首先我确定它的“id”值，在本例中为“experience-section”。用“find_element_by_id”方法抓取它。

# Get stuff in work experience section
html_list = driver.find_element_by_id("experience-section")

此部分包含“li”元素列表（即标记值“li”），每个元素都包含每个过去工作的所有工作信息。使用“find_elements_by_tag_name”创建这些 WebElement 类型的列表。

# Jobs listed as li sections, create list of li 
items = html_list.find_elements_by_tag_name("li")

查看源代码，我注意到例如雇主名称可以通过标签“p”来识别。这会生成一个列表，有时它包含多个项目。确保选择您需要的：

x = items[0].find_elements_by_tag_name("p")
print(x[0].text)
# "Company Name"
print(x[1].text)
# "Wagestream Full-time"

最后循环遍历“li”部分，提取相关信息、提取字符串并打印所需信息（或在 CSV 中另存为行）：

# Loop through li list, extract each piece by tag name
for item in items:
    name_job = item.find_elements_by_tag_name("h3")
    name_emp = item.find_elements_by_tag_name("p")
    more = item.find_elements_by_tag_name("h4")
    job = name_job[0].text
    emp = name_emp[1].text
    # This just cleans up the string
    yrs = [item for item in more[0].text.split('\n')][1]
    loc = [item for item in more[2].text.split('\n')][1]
    
    print(job)
    print(emp)
    print(yrs)
    print(loc)

# terminates the application
driver.quit()

【讨论】：