为此任务设计正则表达式可能不是最好的主意。但是,如果必须,我们可以使用捕获组,并逐步收集所需的数据:
<(.+?):([a-z]+)\s(contextref)(=")(.+?)"\s(decimals)(=")(.+?)"\s(unitref)(=")(.+?)">(.+?)<\/(.+?):([a-z]+)>
如果需要最后一个逗号,我们可以简单地将其修改为:
<(.+?):([a-z]+)\s(contextref)(=")(.+?)"\s(decimals)(=")(.+?)"\s(unitref)(=")(.+?)">(.+?)<\/(.+?):([a-z]+)>,?
正则表达式
如果不需要此表达式,可以在 regex101.com 中修改或更改。
正则表达式电路
jex.im 还有助于将表达式可视化。
测试
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"<(.+?):([a-z]+)\s(contextref)(=\")(.+?)\"\s(decimals)(=\")(.+?)\"\s(unitref)(=\")(.+?)\">(.+?)<\/(.+?):([a-z]+)>"
test_str = ("<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref=\"Duration_Actual_PerdidasFiscales_1\" decimals=\"-3\" unitref=\"CLP\">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,\n"
"<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref=\"Duration_Actual_UnusedTaxLossesMember\" decimals=\"-3\" unitref=\"CLP\">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,\n"
" <ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref=\"TrimestreAcumuladoActual\" decimals=\"-3\" unitref=\"CLP\">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
演示
这个 sn-p 只是为了说明捕获组是如何工作的:
const regex = /<(.+?):([a-z]+)\s(contextref)(=\")(.+?)\"\s(decimals)(=\")(.+?)\"\s(unitref)(=\")(.+?)\">(.+?)<\/(.+?):([a-z]+)>/gm;
const str = `<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="Duration_Actual_PerdidasFiscales_1" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,
<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="Duration_Actual_UnusedTaxLossesMember" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>,
<ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity contextref="TrimestreAcumuladoActual" decimals="-3" unitref="CLP">-4088611000</ifrs-full:deferredtaxrelatingtoitemschargedorcrediteddirectlytoequity>`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}