【问题标题】:xml parsing returns html, how to get the text of it pythonxml解析返回html,如何获取它的文本python
【发布时间】:2015-11-08 19:47:28
【问题描述】:

我正在使用 minidom 解析一个 xbrl 文件。我使用 getElementsByTagName

找到以下内容
<table xmlns="http://www.w3.org/1999/xhtml" style="border-right: 0px; border-top: 0px; border-left: 0px; width: 650px; border-bottom: 0px; border-collapse: collapse"  width="100%"><tr><td colspan="1">Independent auditor's report on the financial statements</td></tr></table><br><table xmlns="http://www.w3.org/1999/xhtml" style="border-right: 0px; border-top: 0px; border-left: 0px; width: 650px; border-bottom: 0px; border-collapse: collapse"  width="100%"><tr><td colspan="1">We have audited the financial statements of KPMG Statsautoriseret Revisionspartnerselskab for the financial year 11 December 2013 – 31 December 2014. The financial statements comprise income statement, balance sheet, statement of changes in equity, cash flow statement accounting policies and notes. The financial statements are prepared in accordance with the Danish Financial Statements Act.</td></tr></table>

现在我只想从中获取文本,我应该如何进行?从现在开始我是否应该选择beautifulsoup?

整个文件可以在here找到,我正在查看的字段是&lt;arr:AuditorsReportOnFinancialStatements

【问题讨论】:

    标签: python xml python-2.7 xml-parsing html-parsing


    【解决方案1】:
    soup = BeautifulSoup(auditorsReportOnAuditedFS[0].firstChild.data)
        items = soup.find_all('td')
        listForString = []
        for item in items:
            listForString.append(item.text.encode('utf-8').strip())
        result.append(' : '.join(['AuditorsReportOnFinancialStatements', ' - '.join(listForString)]))
    

    这行得通

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-04-05
      • 1970-01-01
      • 2023-04-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-09-12
      • 1970-01-01
      相关资源
      最近更新 更多