【问题标题】:If condition to ignore xpath when it is missing in one instance of the list如果在列表的一个实例中缺少 xpath 时忽略它的条件
【发布时间】:2021-09-10 02:20:07
【问题描述】:

我目前正在尝试使用这段代码来抓取 LinkedIn 工作页面:

# importing packages
import pandas as pd
import re

from bs4 import Tag, NavigableString, BeautifulSoup
from datetime import date, timedelta, datetime
from IPython.core.display import clear_output
from random import randint
from requests import get
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep
from time import time
start_time = time()

from warnings import warn

# replace variables here.
url = "https://www.linkedin.com/jobs/search?keywords=&location=Egypt&geoId=&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&sortBy=DD"
no_of_jobs = 25

# this will open up new window with the url provided above 
driver = webdriver.Chrome()
driver.get(url)
sleep(3)
action = ActionChains(driver)


# to show more jobs. Depends on number of jobs selected
i = 2
while i <= (no_of_jobs/25): 
    driver.find_element_by_xpath('/html/body/main/div/section/button').click()
    i = i + 1
    sleep(5)

# parsing the visible webpage
pageSource = driver.page_source
lxml_soup = BeautifulSoup(pageSource, 'lxml')

# searching for all job containers
job_container = lxml_soup.find('ul', class_ = 'jobs-search__results-list')

print('You are scraping information about {} jobs.'.format(len(job_container)))


# setting up list for job information
job_id = []
post_title = []
company_name = []
post_date = []
job_location = []
job_desc = []
level = []
emp_type = []
functions = []
industries = []

# for loop for job title, company, id, location and date posted
for job in job_container:

    if not isinstance(job, Tag):
        continue
    # job title
    job_titles = job.find("h3", class_="base-search-card__title").text
    post_title.append(job_titles)
    
    # linkedin job id
    job_ids = job.find('a', href=True)['href']
    job_ids = re.findall(r'(?!-)([0-9]*)(?=\?)',job_ids)[0]
    job_id.append(job_ids)
    
    # company name
    company_names = job.select_one('img')['alt']
    company_name.append(company_names)
    
    # job location
    job_locations = job.find("span", class_="job-search-card__location").text
    job_location.append(job_locations)
    
    # posting date
    post_dates = job.select_one('time')['datetime']
    post_date.append(post_dates)

# for loop for job description and criterias
for x in range(1,no_of_jobs):
    
        
    # clicking on different job containers to view information about the job

    job_xpath = '/html/body/div[3]/div/main/section/ul/li[{}]'.format(x)
    driver.find_element_by_xpath(job_xpath).click()
    sleep(3)
    
    # job description
    jobdesc_xpath = '/html/body/div[3]/div/section/div[2]/section[2]/div'
    job_descs = driver.find_element_by_xpath(jobdesc_xpath).text
    job_desc.append(job_descs)
    
    # job criteria container below the description
    job_criteria_container = lxml_soup.find('ul', class_ = 'description__job-criteria-list')
    all_job_criterias = job_criteria_container.find_all("ul", class_='description__job-criteria-list')
    
    # Seniority level
    seniority_xpath = '/html/body/div[3]/div/section/div[2]/section[2]/ul/li[1]/span'
    seniority = driver.find_element_by_xpath(seniority_xpath).text
    level.append(seniority)
    
    # Employment type
    type_xpath = '/html/body/div[3]/div/section/div[2]/section[2]/ul/li[2]/span'
    employment_type = driver.find_element_by_xpath(type_xpath).text
    emp_type.append(employment_type)
    
    # No Applicants
    function_xpath = 'num-applicants__caption'
    No_Applicants = driver.find_element_by_class_name(function_xpath).text
    functions.append(No_Applicants)
    
    # Industries
    industry_xpath = '/html/body/div[3]/div/section/div[2]/section[2]/ul/li[4]/span'
    industry_type = driver.find_element_by_xpath(industry_xpath).text
    industries.append(industry_type)
    
    x = x+1

# to check if we have all information
print(len(job_id))
print(len(post_date))
print(len(company_name))
print(len(post_title))
print(len(job_location))
print(len(job_desc))
print(len(level))
print(len(emp_type))
print(len(functions))
print(len(industries))

我要抓取的网址是:

https://www.linkedin.com/jobs/search?keywords=&location=Egypt&geoId=&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&sortBy=DD

在我遍历工作条件的第二个 for 循环中,在 LinkedIn 上的某些工作中,他们没有输入就业类型或行业!当它在包含它们的列表项上循环时!它工作得很好!但是当它到达一个不包含该元素的列表项时,它会返回一个元素未找到错误!如果在列表项中找不到就业类型或行业类型,我该怎么写?忽略它们并继续下一个!

【问题讨论】:

    标签: python selenium xpath


    【解决方案1】:

    我怎么写和如果条件说如果就业类型或 列表项中未找到该行业类型!忽略它们和 继续下一个

    有几种方法可以做到这一点,但输出错误有助于诊断问题。由于您特别想要一种忽略异常的方法,请尝试使用 selenium 进行错误处理,因为我相信您从描述中得到了 NoSuchElementException:

    from selenium.common.exceptions import NoSuchElementException
    try:
        # line of code that is giving you an error, or the entire loop (not recommended).
    except NoSuchElementException:
        pass # Or do something else useful like log the output.
    

    您可以在 selenium here 中阅读有关此特定异常以及许多其他异常的更多信息。请注意,例如,如果此错误不是来自 selenium 的错误,您也可以捕获所有错误并使用 'except:'。

    【讨论】:

      猜你喜欢
      • 2019-09-18
      • 1970-01-01
      • 1970-01-01
      • 2011-08-24
      • 2017-10-08
      • 2012-04-21
      • 2019-06-02
      • 1970-01-01
      • 2021-12-21
      相关资源
      最近更新 更多