【发布时间】:2021-08-27 21:35:25
【问题描述】:
我正在使用 Selenium 从 Python 中的 LinkedIn 个人资料中抓取数据。它大部分都在工作,但我不知道如何在他们的历史部分中为每个雇主或学校提取信息。
我正在学习以下教程:https://www.linkedin.com/pulse/how-easy-scraping-data-from-linkedin-profiles-david-craven/
我正在查看此个人资料:https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk
这是我正在努力解决的 HTML 部分的部分 sn-p:
<section id="experience-section" class="pv-profile-section experience-section ember-view"><header class="pv-profile-section__card-header">
<h2 class="pv-profile-section__card-heading">
Experience
</h2>
<!----></header>
<ul class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-more">
<li id="ember136" class="pv-entity__position-group-pager pv-profile-section__list-item ember-view"> <section id="1762786165" class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view"> <div class="display-flex justify-space-between full-width">
<div class="display-flex flex-column full-width">
<a data-control-name="background_details_company" href="/company/wagestream/" id="ember138" class="full-width ember-view"> <div class="pv-entity__logo company-logo">
<img src="https://media-exp1.licdn.com/dms/image/C560BAQEkzVWoORqWFQ/company-logo_100_100/0/1615996325297?e=1631145600&v=beta&t=SoZQKV09PqqYxYTzbjqV4XTJa7HkGUZRe4QT0jU5hmE" loading="lazy" alt="Wagestream" id="ember140" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image ember-view">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section ">
<h3 class="t-16 t-black t-bold">Senior Software Engineer</h3>
<p class="visually-hidden">Company Name</p>
<p class="pv-entity__secondary-title t-14 t-black t-normal">
Wagestream
<span class="pv-entity__secondary-title separator">Full-time</span>
</p>
<div class="display-flex">
<h4 class="pv-entity__date-range t-14 t-black--light t-normal">
<span class="visually-hidden">Dates Employed</span>
<span>Apr 2021 – Present</span>
</h4>
<h4 class="t-14 t-black--light t-normal">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item-v2">3 mos</span>
</h4>
</div>
<h4 class="pv-entity__location t-14 t-black--light t-normal block">
<span class="visually-hidden">Location</span>
<span>London, England, United Kingdom</span>
</h4>
<!---->
</div>
</a>
<!----> </div>
<!----> </div>
</section>
这之后是更多的“li”部分。所以整个历史部分可以用 id="experience-section" 来标识,工作(相对于教育)历史可以在 "ul" 部分 class="pv-profile-section__section-info section-info pv-profile 中标识-section__section-info--has-more”。列表中第一个作业的信息可以用 "li" section id="ember136" 来标识。
我正在尝试从这部分获取职位名称、公司、工作年限等,但不知道该怎么做。这是一些 python 代码来显示我尝试过的内容(跳过我的登录):
from parsel import Selector
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
path = r'C:\Program Files (x86)\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)
# driver.get method() will navigate to a page given by the URL address
driver.get('https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk')
text=driver.page_source
sel = Selector(text)
# Using the "Copy xPath" option in Inspect in Google Chrome, I can manually extract the company name
sel.xpath('//*[@id="ember187"]/div[2]/p[2]/text()').extract_first()
# This will give me all of the text in the Work Experience section
stuff = driver.find_element_by_id("experience-section")
items = html_list.find_elements_by_tag_name("ul")
items = html_list.find_elements_by_tag_name("h3")
for item in items:
print(type(item))
text = item.text
print(text)
但这些方法对于从个人资料中的每项工作中自动系统地提取信息并不适用。我想做的是像循环遍历每个“ul”部分中的“li”部分,并在“li”部分中,仅提取公司名称 class= "pv-entity__secondary-title t-14 t-black t-正常”。但是 find_element_by_class_name 只产生 NoneTypes。
我不确定如何在概念上使用 selenium 生成“ul”和“li”的可迭代列表,并在每次迭代中使用类名提取特定的文本位。
【问题讨论】:
标签: python html selenium web-scraping linkedin