【问题标题】:untagged text extraction with python is not working使用 python 提取未标记的文本不起作用
【发布时间】:2017-09-21 03:40:25
【问题描述】:

我想用python和美汤从下面的标签中提取1626 我试过这个答案Accessing untagged text using beautifulsoup 但我得到的只是一个空数组[]

<div class="columns">
<h1 style="line-height: .85em; margin-top: 0" class="panel-border text-primary strong">
            Laundry Dry Cleaning Equipment
            <br>

            <br>
</h1>

        1626 Total Items
<!-- br-->
<div>...</div>
</div>

如何提取数字?

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    您可以循环浏览 html 代码并使用正则表达式找到您需要的内容

    import bs4, re
    
    page = """
    <div class="columns">
    <h1 style="line-height: .85em; margin-top: 0" class="panel-border text-primary strong">
                Laundry Dry Cleaning Equipment
                <br>
    
                <br>
    </h1>
    
            1626 Total Items
        5526 Total Items
                        4426 Total Items
    <!-- br-->
    <div>...</div>
    </div>"""
    
    soup = bs4.BeautifulSoup(page, 'lxml')
    
    divs = soup.findAll('div', {'class' : 'columns'})
    div= divs[0]    # we only have one div
    
    divtext= str(div).split('\n')   # get div html code and split it's lines
    for line in divtext:
        line = line.strip()
    
        # match wanted pattern
        match = re.match(r'^(\d+)\s*Total Items$', line)
    
        if match is not None:     #if match found
            print(match.group(1)) # extract the number
    

    【讨论】:

      【解决方案2】:

      我尝试使用您在上述问题中附加的 link 中使用的相同约定。

      希望这是您正在寻找的。​​p>

      代码:

      data = """
      <div class="columns">
      <h1 style="line-height: .85em; margin-top: 0" class="panel-border text-primary strong">
                  Laundry Dry Cleaning Equipment
                  <br>
      
                  <br>
      </h1>
      
              1626 Total Items
      <!-- br-->
      <div>...</div>
      </div>
      """
      soup = BeautifulSoup(data, 'html.parser')
      for i in soup.find_all(text=True, recursive=True):
          if "Total Items" in i:
             print(str(i).replace(' ', '').replace('TotalItems', ''))
      

      输出:

      1626
      

      【讨论】:

        猜你喜欢
        • 2016-01-29
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2015-08-16
        • 2017-12-26
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多