【问题标题】:How to exclude unwanted tags when using Beautifulsoup in Python在 Python 中使用 Beautifulsoup 时如何排除不需要的标签
【发布时间】:2021-11-18 17:53:56
【问题描述】:

我正在使用 Beautifulsoup 在 Indeed.com 上练习 python 抓取。

使用 [div class companyLocation] 提取“工作地点”时, 我想要的是在'div class="companyLocation"'之后获取位置字符串。 (在下面的 html 中,“美国”)

但在某些情况下,有额外的“a aria-label”或“span”子句包含不需要的字符串,例如“+1 location”等。

我不知道如何摆脱这些。 所以我征求你的意见。

<div class="companyLocation">United States
<span><a aria-label="Same Python Developer job in 1 other location" class="more_loc" href="/addlLoc/redirect?tk=1fgg7b6pa306m001&amp;jk=d724dab9a2d2af2c&amp;dest=%2Fjobs%3Fq%3Dpython%26limit%3D50%26grpKey%3DkAO5nvwVmAPOkxWgAwHyBwN0Y2w%253D" rel="nofollow">
+1 location</a></span>

<span class="remote-bullet">•</span><span>Remote</span></div>, United States+1 location•Remote

这是我的 Python 代码供您参考。 问题出现了'if a.string is None:' case.

您可以使用以下代码看到上面的 div + span html 子句: 打印(f“{a},{a.text}”)

import requests
from bs4 import BeautifulSoup

url = "https://www.indeed.com/jobs?q=python&limit=50"

extracts_url = requests.get(url)
extracts_soup = BeautifulSoup(extracts_url.text, 'html.parser')
soup_jobs = extracts_soup.find_all("div", {"class": "job_seen_beacon"})

for soup_job in soup_jobs:
    for a in soup_job.select("div.companyLocation"):
        if a.string is not None:
            pass

        #problem(below)
        if a.string is None:
            print(f"{a}, {a.text}")

【问题讨论】:

    标签: python python-3.x beautifulsoup python-requests


    【解决方案1】:

    您混淆了if 语句,请尝试以下操作:

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.indeed.com/jobs?q=python&limit=50"
    
    extracts_url = requests.get(url)
    extracts_soup = BeautifulSoup(extracts_url.text, 'html.parser')
    soup_jobs = extracts_soup.find_all("div", {"class": "job_seen_beacon"})
    
    for soup_job in soup_jobs:
        for a in soup_job.select("div.companyLocation"):
            if a.string is not None:
                print(f"{a}, {a.text}")
    

    输出:

    <div class="companyLocation">United States</div>, United States
    <div class="companyLocation"><span>Remote</span></div>, Remote
    <div class="companyLocation"><span>Remote</span></div>, Remote
    <div class="companyLocation">Boulder, CO</div>, Boulder, CO
    <div class="companyLocation">Houston, TX</div>, Houston, TX
    <div class="companyLocation">Allen, TX</div>, Allen, TX
    <div class="companyLocation"><span>Remote</span></div>, Remote
    <div class="companyLocation"><span>Remote</span></div>, Remote
    <div class="companyLocation"><span>Remote</span></div>, Remote
    <div class="companyLocation">New York, NY</div>, New York, NY
    <div class="companyLocation">New York, NY</div>, New York, NY
    <div class="companyLocation">New York State</div>, New York State
    <div class="companyLocation">Austin, TX</div>, Austin, TX
    <div class="companyLocation">Research Triangle Park, NC</div>, Research Triangle Park, NC
    <div class="companyLocation">New York, NY</div>, New York, NY
    <div class="companyLocation">Cary, NC</div>, Cary, NC
    <div class="companyLocation">Raleigh, NC</div>, Raleigh, NC
    <div class="companyLocation"><span>Remote</span></div>, Remote
    <div class="companyLocation"><span>Remote</span></div>, Remote
    <div class="companyLocation"><span>Remote</span></div>, Remote
    <div class="companyLocation">Houston, TX</div>, Houston, TX
    <div class="companyLocation">Bellevue, WA</div>, Bellevue, WA
    <div class="companyLocation">Houston, TX</div>, Houston, TX
    

    现在它工作得很好。

    【讨论】:

    • 这不是我的问题。 (a.string is not None case 正如你所说的那样工作正常。)问题是运行下面的情况,除了位置之外,你会得到“不需要的”额外字符串。如果 a.string 为无: print(f"{a}, {a.text}")
    • 问题是如果 a.string 是 None 情况,除了我上面提到的位置之外,你会得到不需要的额外字符串。无论如何,谢谢。
    • @Zerwing 预期的输出是什么?
    • 应该和你发布的一样(a.string is not None case)在'If a.string is None'的情况下,你会看到:美国'+5 location'(+5 location not想要;即使在某些情况下,它也会创建新行作为 'United States /n +5 location')
    【解决方案2】:

    这行得通吗?

        #problem(below)
        if a.string is None:
            data=''
            for child in a.children:
                if not child.name and child != '':
                    data+=child
            print(data)
    

    【讨论】:

    • 不完全符合我的要求(我看到空白行),但非常感谢您的帮助!
    猜你喜欢
    • 2017-04-07
    • 1970-01-01
    • 2020-12-02
    • 2013-10-21
    • 1970-01-01
    • 1970-01-01
    • 2020-01-24
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多