python append() 并删除 html 标签答案

【问题标题】：python append() and remove html tagspython append() 并删除 html 标签
【发布时间】：2017-02-22 14:07:06
【问题描述】：

我需要一些帮助。我的输出似乎是错误的。如何正确附加 dept、job_title、job_location 的值。并且有带有dept值的html标签。我怎样才能删除这些标签。

我的代码

response = requests.get("http://hortonworks.com/careers/open-positions/")
soup = BeautifulSoup(response.text, "html.parser")

jobs = []


div_main = soup.select("div#careers_list")


for div in div_main:
    dept = div.find_all("h4", class_="department_title")
    div_career = div. find_all("div", class_="career")
    title = []
    location = []
    for dv in div_career:
        job_title = dv.find("div", class_="title").get_text().strip()
        title.append(job_title)
        job_location = dv.find("div", class_="location").get_text().strip()
        location.append(job_location)

    job = {
        "job_location": location,
        "job_title": title,
        "job_dept": dept
    }
    jobs.append(job)
pprint(jobs)

应该是这样的

{'job_dept'：咨询，

'job_location':'伊利诺伊州芝加哥'

'job_title'：高级顾问 - Central'

每个变量有 1 个值。

【问题讨论】：

请显示你得到的输出...
输出将显示，job_dept：所有部门，job_location：所有位置，job_title：所有标题

标签： python-3.x beautifulsoup append

【解决方案1】：

您的 html 结构是顺序的，而不是分层的，因此您必须遍历您的工作列表并随时更新部门名称：

import requests
from bs4 import BeautifulSoup, Tag
from pprint import pprint
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:21.0) Gecko/20130331 Firefox/21.0'}
response = requests.get("http://hortonworks.com/careers/open-positions/", headers=headers)

soup = BeautifulSoup(response.text, "html.parser")

jobs = []


div_main = soup.select("div#careers_list")


for div in div_main:
    department_title = ""
    for element in div:
        if isinstance(element, Tag) and "class" in element.attrs:
            if "department_title" in element.attrs["class"]:
                department_title = element.get_text().strip()
            elif "career" in element.attrs["class"]:
                location = element.select("div.location")[0].get_text().strip()
                title = element.select("div.title")[0].get_text().strip()
                job = {
                    "job_location": location,
                    "job_title": title,
                    "job_dept": department_title
                }
                jobs.append(job)

pprint(jobs)

【讨论】：

运行此程序时出现此错误。 if isinstance(element, Tag) and element.attrs.has_key("class"): AttributeError: 'dict' object has no attribute 'has_key'
我已经更新了我的答案，所以它可以与 python3 一起使用。
哇。惊人的。它运作良好。输出是正确的..我使用的是pycharm。 “job_dept”部分：department_title。突出显示了部门标题。它说：不能定义名称'department_title'
你必须在使用之前初始化department_title变量。在我们的例子中很好，因为我们有一个固定的 html 标签序列，但是如果在带有 div 类属性的 div 标签之前没有带有 department_title 类属性的 div 标签，那么就会有一个错误。因此，最好在进入第二个for 循环之前将department_title 变量设置为空字符串。
好的，很好。您能否进一步解释一下这一行：if isinstance(element, Tag) and "class" in element.attrs: 这是我第一次看到这个 isinstance。