【发布时间】:2021-03-31 15:51:01
【问题描述】:
我正在尝试解析一些 HTML 文件。我想使用特定的字体大小提取所有文本。例如,我想以他们的风格使用font-size:10px 获取p、div、span 等标签的所有文本。我正在使用 BeautifulSoup 来解析我的 HTML 文件并提取首选数据。为了使用特定字体大小提取标签数据,我在 python 中使用了以下脚本。虽然它不起作用。
file = open("file.html", "r")
soup = BeautifulSoup(file.read(), features="html.parser")
main_texts = soup.findAll(attrs={"style":"font-size:10px"})
for item in main_texts :
whole_text += item.getText()
另外,我的 HTML 内容是这样的
<span style="font-family: TimesNewRomanPSMT; font-size:10px"> </span><span style="font-family: ABCDEE+Calibri-Bold; font-size:11px"> </span><span style="font-family: TimesNewRomanPSMT; font-size:14px">Academic Qualifications <span style="position:absolute; border: black 1px solid; left:36px; top:143px; width:159px; height:0px;"></span>
</span><span style="font-family: TimesNewRomanPSMT; font-size:10px">2012-Current CPA (Australia) CPA (Aust.) Holder </span><span style="font-family: TimesNewRomanPSMT; font-size:14px"> </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">2010-2012 Royal Melbourne Institute of Technology Bachelor in Business (Accountancy) 2006-2009 Ngee Ann Polytechnic Diploma in Accountancy 2002-2005 Henderson Secondary School GCE ‘O’ Levels 1996-2001 River Valley Primary School </span><span style="font-family: TimesNewRomanPSMT; font-size:10px"> PSLE </span><span style="font-family: ABCDEE+Calibri; font-size:15px"> Co-Curriculum Achievements <span style="position:absolute; border: black 1px solid; left:36px; top:399px; width:189px; height:1px;"></span>
</span><span style="font-family: TimesNewRomanPSMT; font-size:10px">NPCC (2002 -2005) -National Youth Achievement Award (Bronze) -CCA Merit Award (NPCC) Ngee Ann Poly (2006 – 2009) -Freshmen Orientation Sub-Committee Member -Freshmen Recruitment Sub-Committee Member -Ngee Ann Canoeing Club Member </span><span style="font-family: ABCDEE+Cambria; font-size:12px"> Nippon Yusen Kaisha (2011- 2012) - Social Recreational Committee Member </span><span style="font-family: ABCDEE+Calibri; font-size:15px"> Personal Skills <span style="position:absolute; border: black 1px solid; left:36px; top:549px; width:91px; height:1px;"></span>
</span><span style="font-family: TimesNewRomanPS-BoldMT; font-size:10px">Software Skills </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">1.</span><span style="font-family: ArialMT; font-size:11px"> </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">Microsoft Office (Excel, Pivot, V-lookup, Powerpoint) Excellent </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">2.</span><span style="font-family: ArialMT; font-size:11px"> </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">Bloomberg Intermediate </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">3.</span><span style="font-family: ArialMT; font-size:11px"> </span><span style="font-family: TimesNewRomanPSMT; font-size:10px">Thomson Reuters Intermediate 4.</span>
你有什么办法解决这个问题吗?
【问题讨论】:
标签: python beautifulsoup