【问题标题】:How to get full job descriptions from Indeed using Python and BeautifulSoup如何使用 Python 和 BeautifulSoup 从 Indeed 获取完整的职位描述
【发布时间】:2021-08-02 21:27:48
【问题描述】:

我需要从 Indeed 上抓取招聘信息。我设法抓取了每个职位的标题和链接,现在正在努力抓取每个职位的完整职位描述(我不想要摘要 - 我想要每个职位的完整职位描述)。

我的代码如下所示:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import time

url = 'https://www.indeed.com/jobs?q=data+engineer&l=United+States'

response = requests.get(url) 
data = response.text
soup = BeautifulSoup(data, 'html.parser')
jobs = soup.find_all('div',{'class':'jobsearch-SerpJobCard'})
  
for job in jobs:
    title = job.find('a',{'class':'jobtitle'}).text
    link1 = job.find('a',{'class':'jobtitle'}).get('href')
    link = 'https://www.indeed.com' + link1
    
    #for each JOB's webpage, you need to connect to the link first:
    job_response = requests.get(link)
    job_data = response.text
    job_soup = BeautifulSoup(job_data, 'html.parser')
    
    job_description_tag = job_soup.find('div',{'id':'jobDescriptionText'})
    job_description = job_description_tag.text if job_description_tag else "N/A"
    
    print('Job Title:', title, '\nLink:', link, '\nJob Description:', job_description, '\n---')
    

我想要每个职位的完整职位描述,并尝试使用 job_description_tag 获取它,但它只返回“N/A”(我将 if 语句放在那里以应对职位发布没有职位描述的情况。) .

每个职位发布的输出都返回一个“N/A”,因此显然有问题。

作为参考,通过检查其中一个职位的职位描述,可以看到我试图抓取的 job_description_tag 的 html 代码如下所示:

<div id="jobDescriptionText" class="jobsearch-jobDescriptionText"><ul>
<li>1+ years of experience as a Data Engineer or in a similar role</li>
<li>Experience with data modeling, data warehousing, and building ETL pipelines</li>
<li>Experience in SQL</li>
<li>Knowledge of python or any general purpose scripting language.</li>
<li>Experience with Big Data technologies such as Hive/Spark.</li>
</ul>
Mission Statement
<br>The core mission of Amazon Web Services (AWS) Marketing is to educate customers about cloud computing and our services. Millions of customers engage with us every day across multiple channels. Imagine building a platform that enables AWS to speak to engineers, CTOs, CIOs, and CEOs, educate them about AWS services, and empower them on their journey to the cloud. Our services act as the foundation for announcing new AWS products and are uniquely positioned to redefine how our cloud community consumes information and engages with AWS.
<br><br>
Overview
<br>Would you like to support increasing customer base and the revenue for AWS, a market-leading cloud offering? Would you like to be part of a team focused on increasing awareness and adoption of the AWS platform by analyzing customer's behavior on and outside AWS websites? Do you want to empower our AWS marketing team make data-driven decisions that further establish AWS as leader in the cloud computing world?
<br><br>
As a Data Engineer at AWS, you will be working in a large, extremely complex and dynamic data warehousing environment. We are looking for someone with the uncanny ability to integrate multiple heterogeneous data sources like Adobe Site Catalyst, Adobe Target, Sales Force, Adobe Connect with AWS central data warehouse and build efficient, flexible, and scalable data warehouse and reporting solutions. You should be enthusiastic about learning new technologies and be able to implement solutions using these technologies to enable upgrades of the existing platform. You should have excellent business and communication skills and be able to work with business owners to develop and define key business questions, then build the data sets that answer those questions. You should be expert at designing, implementing, and operating stable, scalable, low cost solutions to flow data from production systems into the data warehouse and into end-user facing reporting applications. Above all you should be passionate about working with huge data sets and someone who loves to bring datasets together to answer business questions and drive growth.
<br><br>
At AWS, you have control over every layer you build. Instead of owning a small slice of an existing service, you will own a core segment of a growing marketing platform serving 1000s of internal customers and millions of external customers. You will build on multiple AWS services and have opportunities to engage directly with those teams to improve our core offerings. At AWS, we work with our customers on a daily basis to prove out our ideas, gather feedback, and improve the platform.
<br><br>
<b>Location:</b> This position must sit in Seattle, WA. Relocation assistance offered from within the US.
<br><br>
<ul>
<li>Graduate/Master degree in Computer Science, Engineering or related technical field.</li>
<li>Exceptional troubleshooting and problem-solving abilities.</li>
<li>Experience with Amazon Redshift or other distributed computing technology.</li>
<li>Industry experience as a Data Engineer or related specialty (e.g., Software Engineer, Business Intelligence Engineer, Data Scientist) with a track record of manipulating, processing, and extracting value from large datasets.</li>
<li>Experience with AWS Tools and Technologies.</li>
<li>Hands-on experience with cloud computing and UNIX/Linux based systems.</li>
<li>Demonstrated ability to work effectively across various internal organizations.</li>
<li>Excellent written and verbal communications skills.</li>
</ul>
Amazon is committed to a diverse and inclusive workplace. Amazon is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status. For individuals with disabilities who would like to request an accommodation, please visit https://www.amazon.jobs/en/disability/us.</div>

任何帮助将不胜感激!

【问题讨论】:

标签: python html css web-scraping beautifulsoup


【解决方案1】:

您的代码几乎是正确的。只是这行代码的一个错误:

job_data = response.text

替换为:

job_data = job_response.text

【讨论】:

  • 非常感谢!我完全错过了。我刚刚运行了代码,它产生了我需要的确切输出。非常感谢!
  • @TheunisRaubenheimer 如果可以,您可以将答案标记为已接受吗?
【解决方案2】:

完整的职位描述是从外部 URL 加载的。使用这个例子如何加载它:

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0"
}


url = "https://www.indeed.com/jobs?q=data+engineer&l=United+States"
api_url = "https://www.indeed.com/viewjob?viewtype=embedded&jk={job_id}"

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

for job in soup.select('a[id^="job_"]'):
    job_id = job["id"].split("_")[-1]
    s = BeautifulSoup(
        requests.get(api_url.format(job_id=job_id), headers=headers).content,
        "html.parser",
    )

    print(s.title.get_text(strip=True))
    print()
    print(
        s.select_one("#jobDescriptionText").get_text(strip=True, separator="\n")
    )
    print("#" * 80)

打印:

Data Engineer - Remote - Indeed.com

A typical day for this role would be working with stake holders, designing, creating or troubleshooting REST APIs, and managing data intake from MLS boards. We manage day to day tasks and projects that support our staff and business needs.
RE/MAX, LLC is looking for a Data expert and API Developer to support the implementation an APIs across the Enterprise. This role will ensure the solutions are optimized for delivering critical data in a highly availability, real time environment. The API Developer will create, maintain and enhance APIs with an enterprise micro-services lens. Success in this role is bridging the demand for integration with the best fit solution for all stakeholders. This role will also work with the data aggregation team to map data from MLS boards to a common schema.
Essential Duties:
Develop solutions from business and technical data requirements
Support existing integration/API solutions
Map and maintain MLS board data feeds
Ensure barriers to timely completion of deliverables are anticipated and overcome
Build and foster strong relationships with all levels of technical and non-technical audiences
Work effectively and collaboratively with internal and external stakeholders to ensure timely delivery of implementation
Enhance data integration services for the overall benefit of sustainability and usability
Address problems, change, and/or challenges quickly and enthusiastically
Qualifications & Skills:
Data mapping and/or data warehousing
Preferred Real Estate background, manage IDX, IDX plus and VOW data feeds
Understanding of JSON documents
Experience with AWS
Familiar NodeJS and/or Go programming language
Has worked with Elasticsearch
Design and implementation of technology solutions
Collaborative and creative in working in new environments/markets
Effective working on a technical team
Informing senior developers on requirements and development
Analytical problem solver
Quickly build credibility and trust with internal customers
Demonstrate work ethic based on a strong desire to exceed
Highly self-motivated and directed, with keen attention to detail
Proven analytical and creative problem-solving abilities
Strong customer service orientation
Experience working in a team-oriented, collaborative environment
Now is your chance to become part of a world-class, industry leading organization that touts the #1 real estate brand in the world! RE/MAX is a business that builds businesses. We, alongside booj, our award-winning technology company, specialize in providing the tools, training and tech to our real estate network, which includes RE/MAX and Motto Mortgage franchises, agents, brokers and consumers. Join us and build a career where your contribution is heard, your innovative ideas are valued, and hard work and collaboration truly makes a difference.
RE/MAX Holdings, Inc is an equal opportunity employer committed to diversity and inclusion, as well as non-discrimination in employment. All qualified applicants receive consideration without regard to race, color, religion, gender, sexual orientation, national origin, age, veteran status, disability unrelated to performing the essential task of the job or other legally protected categories. All persons shall be afforded equal employment opportunity. #LI-MP1
################################################################################
Data Engineer - New York, NY 10006 - Indeed.com

Project Description:
We are building out a large AI-based anti-fraud platform, which is provided to clients on a SaaS platform. We are expanding due to the success of the product, and the size of the client pipeline. Every 

...and so on.

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2020-09-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-05-26
    • 1970-01-01
    • 2014-07-31
    • 2020-11-23
    相关资源
    最近更新 更多