【问题标题】:RegEx for XML parsing in BeautifulSoup [duplicate]BeautifulSoup 中用于 XML 解析的正则表达式 [重复]
【发布时间】:2019-05-20 00:51:20
【问题描述】:

我需要使用 BeautifulSoup 和 XML 解析器解析一个文件,特别是 XBRL 文件。但是,如果我使用 LXML 解析器或 XML 解析器,输出会有所不同,无法使用我在 lxml 解析器中成功使用的相同正则表达式。我包括脚本的输出。

我需要使用 XML 解析器的原因是它包含大写字母,而我使用 RegX 是因为标签名称随文件而异,并且包含“:”字符。

soup = BeautifulSoup(xbrl, 'xml')
soup.find_all(re.compile('ifrs-full'))
output: []

# But if I use lxml parser and the same RegeX, I get:

soup = BeautifulSoup(xbrl, 'lxml')
soup.find_all(re.compile('ifrs-full'))
output: 
[<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="Duration_Actual_PerdidasFiscales_1" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,
<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="Duration_Actual_UnusedTaxLossesMember" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,
 <ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="TrimestreAcumuladoActual" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>]

我该如何解决这个问题?

【问题讨论】:

  • 使用xml作为解析器,你能试试下面的,让我知道输出吗? for i in soup.find_all(): if 'ifrs-full' in str(i) and i.attrs!={}: print(i)

标签: python regex xml beautifulsoup


【解决方案1】:

为此任务设计正则表达式可能不是最好的主意。但是,如果必须,我们可以使用捕获组,并逐步收集所需的数据:

<(.+?):([a-z]+)\s(contextref)(=")(.+?)"\s(decimals)(=")(.+?)"\s(unitref)(=")(.+?)">(.+?)<\/(.+?):([a-z]+)>

如果需要最后一个逗号,我们可以简单地将其修改为:

<(.+?):([a-z]+)\s(contextref)(=")(.+?)"\s(decimals)(=")(.+?)"\s(unitref)(=")(.+?)">(.+?)<\/(.+?):([a-z]+)>,?

正则表达式

如果不需要此表达式,可以在 regex101.com 中修改或更改。

正则表达式电路

jex.im 还有助于将表达式可视化。

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"<(.+?):([a-z]+)\s(contextref)(=\")(.+?)\"\s(decimals)(=\")(.+?)\"\s(unitref)(=\")(.+?)\">(.+?)<\/(.+?):([a-z]+)>"

test_str = ("<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref=\"Duration_Actual_PerdidasFiscales_1\" decimals=\"-3\" unitref=\"CLP\">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,\n"
    "<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref=\"Duration_Actual_UnusedTaxLossesMember\" decimals=\"-3\" unitref=\"CLP\">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,\n"
    " <ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref=\"TrimestreAcumuladoActual\" decimals=\"-3\" unitref=\"CLP\">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

演示

这个 sn-p 只是为了说明捕获组是如何工作的:

const regex = /<(.+?):([a-z]+)\s(contextref)(=\")(.+?)\"\s(decimals)(=\")(.+?)\"\s(unitref)(=\")(.+?)\">(.+?)<\/(.+?):([a-z]+)>/gm;
const str = `<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="Duration_Actual_PerdidasFiscales_1" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,
<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="Duration_Actual_UnusedTaxLossesMember" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,
 <ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="TrimestreAcumuladoActual" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

【讨论】:

    猜你喜欢
    • 2015-03-13
    • 2010-09-08
    • 2021-07-20
    • 2019-10-18
    • 2016-05-01
    • 1970-01-01
    • 1970-01-01
    • 2023-01-30
    • 2012-01-25
    相关资源
    最近更新 更多