【问题标题】:How to access data within nested span tags如何访问嵌套跨度标签中的数据
【发布时间】:2019-06-13 20:38:21
【问题描述】:

我已尝试替换每个字符串,但无法使其正常工作。我可以获取<span>...</span> 之间的所有数据,但如果已关闭,我不能,我该怎么做?之后我尝试替换文本,但我无法做到。我对python很陌生。

我也尝试过使用for x in soup.find_all('/span', class_ = "textLarge textWhite"),但这不会显示任何内容。

相关html:

<div style="width:100%; display:inline-block; position:relative; text- 
align:center; border-top:thin solid #fff; background-image:linear- 
gradient(#333,#000);">
    <div style="width:100%; max-width:1400px; display:inline-block; 
position:relative; text-align:left; padding:20px 15px 20px 15px;">
        <a href="/manpower-fit-for-military-service.asp" title="Manpower 
Fit for Military Service ranked by country">
            <div class="smGraphContainer"><img class="noBorder" 
src="/imgs/graph.gif" alt="Small graph icon"></div>
        </a>
        <span class="textLarge textWhite"><span 
class="textBold">FIT-FOR-SERVICE:</span> 18,740,382</span>
    </div>
    <div class="blockSheen"></div>
</div>

相关python代码:

for y in soup.find_all('span', class_ = "textBold"):
    print(y.text) #this gets FIT-FOR-SERVICE:
for x in soup.find_all('span', class_ = "textLarge textWhite"):
    print(x.text) #this gets FIT-FOR-SERVICE: 18,740,382 but i only want the number 

预期结果"18,740,382"

【问题讨论】:

    标签: python html beautifulsoup request


    【解决方案1】:

    我相信您在这里有两个选择:

    1 - 在父标签span 上使用正则表达式来仅提取数字。

    2 - 使用decompose() 函数从树中删除子span 标签,然后提取文本,如下所示:

    from bs4 import BeautifulSoup
    
    h = """<div style="width:100%; display:inline-block; position:relative; text-
    align:center; border-top:thin solid #fff; background-image:linear-
    gradient(#333,#000);">
        <div style="width:100%; max-width:1400px; display:inline-block;
    position:relative; text-align:left; padding:20px 15px 20px 15px;">
            <a href="/manpower-fit-for-military-service.asp" title="Manpower
    Fit for Military Service ranked by country">
                <div class="smGraphContainer"><img class="noBorder"
    src="/imgs/graph.gif" alt="Small graph icon"></div>
            </a>
            <span class="textLarge textWhite"><span
    class="textBold">FIT-FOR-SERVICE:</span> 18,740,382</span>
        </div>
        <div class="blockSheen"></div>
    </div>"""
    
    soup = BeautifulSoup(h, "lxml")
    soup.find('span', class_ = "textLarge textWhite").span.decompose()
    res = soup.find('span', class_ = "textLarge textWhite").text.strip()
    
    print(res)
    #18,740,382
    

    【讨论】:

      【解决方案2】:

      你可以这样做:

      soup.find('span', {'class':'textLarge textWhite'}).find('span').extract()
      output = soup.find('span', {'class':'textLarge textWhite'}).text.strip()
      

      输出:

      18,740,382

      【讨论】:

        【解决方案3】:

        您可以使用x.find_all(text=True, recursive=False) 而不是使用x.text 获取文本,这将为您提供节点的所有顶级文本(在字符串列表中),而无需进入子节点。以下是使用您的数据的示例:

        for x in soup.find_all('span', class_ = "textLarge textWhite"):
            res = x.find_all(text=True, recursive=False)
            # join and strip the strings then print
            print(" ".join(map(str.strip, res)))
        
        #outputs: '18,740,382'
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2010-11-07
          • 1970-01-01
          • 1970-01-01
          • 2021-01-03
          • 2016-08-18
          相关资源
          最近更新 更多