【问题标题】:Python requests.get not returning text in one of the tags in html documentPython requests.get 未在 html 文档中的标签之一中返回文本
【发布时间】:2020-09-10 06:48:03
【问题描述】:

我正在尝试为个人项目解析来自 Djinni 的职位描述。我正在使用 Python 3.6、BeautifulSoup4 和 requests 库。当我使用 requests.get 获取职位空缺页面的 html 时,它返回的 html 没有最关键的部分 - 描述文本。例如,获取此页面的 url - example 和我编写的以下代码:

def scrape_job_desc(self, url):
    job_desc_html = self._get_search_page_html(url)
    soup = BeautifulSoup(job_desc_html, features='html.parser')
    try:
        short_desc = str(soup.find('p', {'class': 'job-teaser svelte-a3rpl2'}).getText())
        full_desc = soup.find('div', {'class': 'job-description-wrapper svelte-a3rpl2'}).find('p').getText()
    except AttributeError:
        short_desc = None
        full_desc = None
    return short_desc, full_desc

def _get_search_page_html(self, url):
    html = requests.get(url=url, headers={'User-Agent': 'Mozilla/5.0 CK={} (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'})
    return html.text

它将返回 short_desc 但不返回 full_desc。此外,所需的

标记的文本根本不存在于 html 中。但是当我使用浏览器下载页面时,它就在那里。这是什么原因造成的?

【问题讨论】:

    标签: python parsing beautifulsoup python-requests


    【解决方案1】:

    作业的完整描述以 JavaScript 变量的形式存储在页面中。可以使用selenium解压,或者re模块:

    import re
    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://djinni.co/jobs2/144172-data-scientist'        
    html_data = requests.get(url).text
    
    full_desc = re.search(r'fullDescription:"(.*?)",', html_data).group(1).replace(r'\r\n', '\n')
    short_desc =  BeautifulSoup(html_data, 'html.parser').select_one('.job-teaser').get_text()
    
    print(short_desc)
    print('-' * 80)
    print(full_desc)
    

    打印:

    Together Networks is looking for an experienced Data Scientist to join our Agile team. Together Networks is a worldwide leader in the online dating niche with millions of users across more than 45 countries. Our brands are BeNaughty, CheekyLovers, Flirt, Click&Flirt, Flirt Spielchen.
    --------------------------------------------------------------------------------
    What you get to deal with:
    
    - Active collaboration with stakeholders throughout the organization;
    - User experience modelling;
    - Advanced segmentation;
    - User behavior analytics;
    - Anomaly detection, fraud detection;
    - Looking for bottlenecks;
    - Churn prediction.
     
    
    You need to have (required):
    
    - Masterâs or PHD in Statistics, Mathematics, Computer Science or another quantitative field;
    - 2+ years of experience manipulating data sets and building statistical models;
    - Strong knowledge in a wide range of machine learning methods and algorithms for classification, regression, clustering, and others;
    - Knowledge and experience in statistical and data mining techniques;
    - Experience using statistical computer languages (Python, SLQ, etc.) to manipulate data and draw insights from large data sets.
    - Knowledge of a variety of machine learning techniques and their real-world advantages\u002Fdrawbacks;
    - Experience visualizing\u002Fpresenting insights for stakeholders;
    - Independent, creative thinking, and ability to learn fast.
    
    Would be a great plus:
    
    - Experience dealing with end to end machine learning projects: data exploration, feature engineering\u002Fdefinition, model building, production, maintenance;
    - Experience in data visualization with Tableau;
    - Experience in dating, game dev, social projects.
    

    【讨论】:

      【解决方案2】:

      这是网页抓取时的一个典型错误。

      您可能在浏览器中查看了呈现的 HTML 的源代码,并试图在 job-description-wrapper 内的 p 中获取文本 div

      但是,如果您只是加载页面本身(浏览器处理的第一个请求)并查看其内容,您会发现该段落最初并未加载。一些脚本稍后会为其加载内容 - 但这发生得如此之快,您几乎不会注意到它作为用户。

      检查输出:

      print(requests.get(url='https://djinni.co/jobs2/144172-data-scientist').text)
      

      这就是造成问题的原因。如何解决是另一回事。一种方法是在您的 Python 中使用无头浏览器,该浏览器在加载页面后运行脚本,并且仅在页面完成加载所有内容时,才能获取您需要的内容。您可以查看 selenium 之类的工具。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2017-01-04
        • 2021-10-22
        • 1970-01-01
        • 2014-03-25
        • 2021-08-08
        • 2020-02-24
        • 1970-01-01
        • 2016-05-07
        相关资源
        最近更新 更多