【问题标题】:Need to extract all the font sizes and the text using beautifulsoup需要使用 beautifulsoup 提取所有字体大小和文本
【发布时间】:2016-12-25 01:29:37
【问题描述】:

我的本​​地系统上存储了以下 html 文件:

<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:71px; width:322px; height:38px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:30px">One
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:104px; width:175px; height:40px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Two</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: two txt
<br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Three</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Three txt
<br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Four</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Four txt
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:274px; top:144px; width:56px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">FIVE
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:171px; width:515px; height:44px;"><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">five txt
<br>five txt2 
<br>five txt3
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:220px; top:223px; width:164px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">SIX
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:44px; top:247px; width:489px; height:159px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:17px">six txt
<br></span><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">six txt2
<br>- six txt2
<br>• six txt3
<br>• six txt4 
<br>• six txt5
<br></span>

我需要提取此 html 文件中出现的所有字体大小。我正在使用 beautifulsoup,但我只知道如何提取文本。

我可以使用以下代码提取文本:

from bs4 import BeautifulSoup
htmlData = open('/home/usr/Downloads/files/output.html', 'r')
soup = BeautifulSoup(htmlData)

texts = soup.findAll(text=True)

我需要提取每段文本的字体大小并将字体-文本对存储到列表或数组中。具体来说,我想要一个像[('One','30'),('Two','15')] 这样的数据结构,其中30 来自font-size:30px,15 来自font-size:15px

唯一的问题是我无法找到获取字体大小值的方法。有任何想法吗?

【问题讨论】:

    标签: python html fonts beautifulsoup


    【解决方案1】:

    您必须自己进行一些研究,beautiful soup documentationregex doc 是您应该阅读并了解事情流程的东西。

    查看以下示例,该示例使用正则表达式提取 first 出现的字体大小,然后正确拆分以仅获取像素数。

    from bs4 import BeautifulSoup as Soup
    from bs4 import Tag
    import re
    
    data = """
      <span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
      <div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
      <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:71px; width:322px; height:38px;">
        <span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:30px">One
        <br></span>
      </div>
      <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:104px; width:175px; height:40px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Two</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: two txt
      <br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Three</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Three txt
      <br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Four</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Four txt
      <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:274px; top:144px; width:56px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">FIVE
      <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:171px; width:515px; height:44px;"><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">five txt
      <br>five txt2 
      <br>five txt3
      <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:220px; top:223px; width:164px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">SIX
      <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:44px; top:247px; width:489px; height:159px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:17px">six txt
      <br></span><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">six txt2
      <br>- six txt2
      <br> six txt3
      <br> six txt4 
      <br> six txt5
      <br></span>
    """
    soup = Soup(data, 'html.parser')
    
    def get_the_start_of_font(attr):
      """ Return the index of the 'font-size' first occurrence or None. """
      match = re.search(r'font-size:', attr)
      if match is not None:
        return match.start()
      return None 
    
    def get_font_size_from(attr):
      """ Return the font size as string or None if not found. """
      font_start_i = get_the_start_of_font(attr)
      if font_start_i is not None:
        return str(attr[font_start_i + len('font-size:'):].split('px')[0])
      return None
    
    # iterate through all descendants:
    fonts = []
    for child in soup.descendants:
      if isinstance(child, Tag) is True and child.get('style') is not None:
        font = get_font_size_from(child.get('style'))
        if font is not None:
          fonts.append([
            str(child.text).strip(), font])
    
    print(fonts)
    

    解决方案可以改进,但这是一个工作示例。

    【讨论】:

    • 感谢 a2a。我一定会阅读 BeautifulSoup 和正则表达式的文档。
    【解决方案2】:

    希望这会有所帮助:我建议您阅读更多关于 BeautifulSoup 的文档

    from bs4 import BeautifulSoup
    htmlData = open('/home/usr/Downloads/files/output.html', 'r')
    soup = BeautifulSoup(htmlData)
    
    font_spans = [ data for data in soup.select('span') if 'font-size' in str(data) ]
    output = []
    for i in font_spans:
        tup = ()
        fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)',str(i.get('style'))).group(2)
        tup = (str(i.text).strip(), fonts_size.strip())
        output.append(tup)
    
    print(output)
    [('One', '30'),('Two', '15'), ....]
    

    如果您想消除包含txt 的文本值,您可以添加if not 'txt' in i.text:

    解释:

    首先你需要识别包含font-size的标签,

    font_spans = [ data for data in soup.select('span') if 'font-size' in str(data) ]
    

    然后你需要迭代font_spans并提取字体大小和文本值,

    textvalue = i.text # One,Two..
    fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)',str(i.get('style'))).group(2) # 30, 15, 16..
    

    最后,您需要创建一个列表,其中包含元组中的所有输出。

    【讨论】:

      【解决方案3】:

      您可以使用 css select select("[style*=font-size]") 来查找具有包含 font-size 的样式属性的标签,并使用正则表达式来提取值:

      In [12]: from bs4 import BeautifulSoup
      
      In [13]: import re
      
      In [14]: soup = BeautifulSoup(html, "html.parser")
      
      In [15]: patt = re.compile("font-size:(\d+)")
      
      In [16]: [(tag.text.strip(), patt.search(tag["style"]).group(1)) for tag in soup.select("[style*=font-size]")]
      Out[16]: 
      [('One', '30'),
       ('Two', '15'),
       (': two txt', '16'),
       ('Three', '15'),
       (': Three txt', '16'),
       ('Four', '15'),
       (': Four txt', '16'),
       ('FIVE', '19'),
       ('five txt\nfive txt2\nfive txt3', '18'),
       ('SIX', '19'),
       ('six txt', '17'),
       ('six txt2\n- six txt2\n• six txt3\n• six txt4\n• six txt5', '18')]
      

      【讨论】:

      • 这比其他答案更通用
      猜你喜欢
      • 1970-01-01
      • 2013-08-03
      • 2021-10-14
      • 1970-01-01
      • 2013-05-01
      • 2012-04-08
      • 2017-05-19
      • 2020-06-18
      • 1970-01-01
      相关资源
      最近更新 更多