在 Python 中使用 Beautifulsoup 时如何排除不需要的标签答案

【问题标题】：How to exclude unwanted tags when using Beautifulsoup in Python在 Python 中使用 Beautifulsoup 时如何排除不需要的标签
【发布时间】：2021-11-18 17:53:56
【问题描述】：

我正在使用 Beautifulsoup 在 Indeed.com 上练习 python 抓取。

使用 [div class companyLocation] 提取“工作地点”时，我想要的是在'div class="companyLocation"'之后获取位置字符串。（在下面的 html 中，“美国”）

但在某些情况下，有额外的“a aria-label”或“span”子句包含不需要的字符串，例如“+1 location”等。

我不知道如何摆脱这些。所以我征求你的意见。

<div class="companyLocation">United States
<span><a aria-label="Same Python Developer job in 1 other location" class="more_loc" href="/addlLoc/redirect?tk=1fgg7b6pa306m001&amp;jk=d724dab9a2d2af2c&amp;dest=%2Fjobs%3Fq%3Dpython%26limit%3D50%26grpKey%3DkAO5nvwVmAPOkxWgAwHyBwN0Y2w%253D" rel="nofollow">
+1 location</a></span>

<span class="remote-bullet">•</span><span>Remote</span></div>, United States+1 location•Remote

这是我的 Python 代码供您参考。问题出现了'if a.string is None:' case.

您可以使用以下代码看到上面的 div + span html 子句：打印（f“{a}，{a.text}”）

import requests
from bs4 import BeautifulSoup

url = "https://www.indeed.com/jobs?q=python&limit=50"

extracts_url = requests.get(url)
extracts_soup = BeautifulSoup(extracts_url.text, 'html.parser')
soup_jobs = extracts_soup.find_all("div", {"class": "job_seen_beacon"})

for soup_job in soup_jobs:
    for a in soup_job.select("div.companyLocation"):
        if a.string is not None:
            pass

        #problem(below)
        if a.string is None:
            print(f"{a}, {a.text}")

【问题讨论】：

标签： python python-3.x beautifulsoup python-requests

【解决方案1】：

您混淆了if 语句，请尝试以下操作：

import requests
from bs4 import BeautifulSoup

url = "https://www.indeed.com/jobs?q=python&limit=50"

extracts_url = requests.get(url)
extracts_soup = BeautifulSoup(extracts_url.text, 'html.parser')
soup_jobs = extracts_soup.find_all("div", {"class": "job_seen_beacon"})

for soup_job in soup_jobs:
    for a in soup_job.select("div.companyLocation"):
        if a.string is not None:
            print(f"{a}, {a.text}")

输出：

<div class="companyLocation">United States</div>, United States
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">Boulder, CO</div>, Boulder, CO
<div class="companyLocation">Houston, TX</div>, Houston, TX
<div class="companyLocation">Allen, TX</div>, Allen, TX
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">New York State</div>, New York State
<div class="companyLocation">Austin, TX</div>, Austin, TX
<div class="companyLocation">Research Triangle Park, NC</div>, Research Triangle Park, NC
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">Cary, NC</div>, Cary, NC
<div class="companyLocation">Raleigh, NC</div>, Raleigh, NC
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">Houston, TX</div>, Houston, TX
<div class="companyLocation">Bellevue, WA</div>, Bellevue, WA
<div class="companyLocation">Houston, TX</div>, Houston, TX

现在它工作得很好。

【讨论】：

这不是我的问题。（a.string is not None case 正如你所说的那样工作正常。）问题是运行下面的情况，除了位置之外，你会得到“不需要的”额外字符串。如果 a.string 为无： print(f"{a}, {a.text}")
问题是如果 a.string 是 None 情况，除了我上面提到的位置之外，你会得到不需要的额外字符串。无论如何，谢谢。
@Zerwing 预期的输出是什么？
应该和你发布的一样（a.string is not None case）在'If a.string is None'的情况下，你会看到：美国'+5 location'（+5 location not想要；即使在某些情况下，它也会创建新行作为 'United States /n +5 location'）

【解决方案2】：

这行得通吗？

    #problem(below)
    if a.string is None:
        data=''
        for child in a.children:
            if not child.name and child != '':
                data+=child
        print(data)

【讨论】：

不完全符合我的要求（我看到空白行），但非常感谢您的帮助！