【问题标题】:get all tags using specific style by BeautifulSoup in python使用 Python 中 BeautifulSoup 的特定样式获取所有标签
【发布时间】:2021-03-31 15:51:01
【问题描述】:

我正在尝试解析一些 HTML 文件。我想使用特定的字体大小提取所有文本。例如,我想以他们的风格使用font-size:10px 获取pdivspan 等标签的所有文本。我正在使用 BeautifulSoup 来解析我的 HTML 文件并提取首选数据。为了使用特定字体大小提取标签数据,我在 python 中使用了以下脚本。虽然它不起作用。

file = open("file.html", "r")
soup = BeautifulSoup(file.read(), features="html.parser")
main_texts = soup.findAll(attrs={"style":"font-size:10px"})
for item in main_texts :
   whole_text += item.getText()

另外,我的 HTML 内容是这样的

<span style="font-family: TimesNewRomanPSMT; font-size:10px">   </span><span style="font-family: ABCDEE+Calibri-Bold; font-size:11px"> </span><span style="font-family: TimesNewRomanPSMT; font-size:14px">Academic Qualifications <span style="position:absolute; border: black 1px solid; left:36px; top:143px; width:159px; height:0px;"></span>
</span><span style="font-family: TimesNewRomanPSMT; font-size:10px">2012-Current  CPA (Australia)    CPA (Aust.) Holder </span><span style="font-family: TimesNewRomanPSMT; font-size:14px"> </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">2010-2012  Royal Melbourne Institute of Technology     Bachelor in Business (Accountancy)       2006-2009         Ngee Ann Polytechnic                            Diploma in Accountancy  2002-2005        Henderson Secondary School                           GCE ‘O’ Levels  1996-2001        River Valley Primary School </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">                            PSLE </span><span style="font-family: ABCDEE+Calibri; font-size:15px">  Co-Curriculum Achievements <span style="position:absolute; border: black 1px solid; left:36px; top:399px; width:189px; height:1px;"></span>
</span><span style="font-family: TimesNewRomanPSMT; font-size:10px">NPCC (2002 -2005)     -National Youth Achievement Award (Bronze)       -CCA Merit Award (NPCC)  Ngee Ann Poly (2006 – 2009)   -Freshmen Orientation Sub-Committee Member       -Freshmen Recruitment Sub-Committee Member       -Ngee Ann Canoeing Club Member </span><span style="font-family: ABCDEE+Cambria; font-size:12px"> Nippon Yusen Kaisha (2011- 2012) - Social Recreational Committee Member </span><span style="font-family: ABCDEE+Calibri; font-size:15px"> Personal Skills <span style="position:absolute; border: black 1px solid; left:36px; top:549px; width:91px; height:1px;"></span>
</span><span style="font-family: TimesNewRomanPS-BoldMT; font-size:10px">Software Skills </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">1.</span><span style="font-family: ArialMT; font-size:11px"> </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">Microsoft Office (Excel, Pivot, V-lookup, Powerpoint) Excellent </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">2.</span><span style="font-family: ArialMT; font-size:11px"> </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">Bloomberg       Intermediate </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">3.</span><span style="font-family: ArialMT; font-size:11px"> </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">Thomson Reuters      Intermediate 4.</span>

你有什么办法解决这个问题吗?

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    您可以使用[attribute*="value"] CSS 选择器。

    要使用 CSS 选择器,而不是 .find_all() 方法,请使用 .select()

    ...
    # The following will select all `style` elements containing `font-size:10px`
    main_text = soup.select('[style*="font-size:10px"]')
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2016-10-17
      • 1970-01-01
      • 2016-05-10
      • 2014-11-04
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-05-02
      相关资源
      最近更新 更多