使用 awk 替换 xml 标记值 [重复]答案

【问题标题】：Substitute xml tag value using awk [duplicate]使用 awk 替换 xml 标记值 [重复]
【发布时间】：2022-02-22 17:06:59
【问题描述】：

使用 awk 提取标签值的脚本

XML

2022-02-08 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>sam</name><phone>98762123</phone><DOB>12-09-77</DOB><bankaccount>4563728495847</bankaccount></sometag>
2022-02-09 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>sam</name><phone>123456789</phone><DOB>12-09-77</DOB><bankaccount>4563728495847</bankaccount></sometag>
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH

输出文件

2022-02-08 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>9**6**2*</phone><DOB>**-**-**</DOB><bankaccount>4**37**4**8**</bankaccount></sometag>
2022-02-09 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>9**6**2*</phone><DOB>12-09-77</DOB><bankaccount>4**37**4**8**</bankaccount></sometag>
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH

PS：我知道使用 awk 解析 xml 不是最好的选择，但我这里没有选择。

【问题讨论】：

edit 您的问题是解释为什么 DOB 和银行帐户在您的查找文件中没有“是”标志时在您的预期输出中被修改，或者修复您的输入或预期输出中的任何一个错误.
为什么 awk 是您唯一的选择？为什么不能使用 XML 感知工具来完成这项工作？
@MadsHansen 看起来 XML 嵌入在每一行中的一些非 XML 纯文本中，这些纯文本也必须在输出中重现。是否有可以处理该输入格式的 XML 感知工具？我想我们可以使用 awk 将 XML 部分与其他部分隔离，然后生成一个子流程来为 XML 部分调用一个支持 XML 的工具，但是我们已经在使用 awk 来识别什么是/不是 XML 和这会在确保 XML 感知工具将所有输出生成为单行、将其读回 awk 以放回行等方面引入一些额外的复杂性，并且会非常慢。
此外，任何支持 XML 的工具都可以执行 OP 要求的那种屏蔽操作，它必须检查其他文件中是否存在每个标签，然后，如果存在，将值更改为用* 替换值中的每个第二个和第三个字符（重复）？我真的不知道，但我对此的直觉，尤其是考虑到如此简单、严格的 XML 输入，认为在 1 次调用 awk 中完成所有操作确实是有意义的。
嗨，Felicity，我拒绝了您对我的回答的编辑，因为您用一个非常不同的 gawk 解决方案替换它。我了解此解决方案更短，可能对您更有效。但我的解决方案可能对想要一个类似问题的 POSIX（便携式）awk 解决方案的人有用。如果问题尚未结束，我建议您将您的编辑作为答案发布并接受（您可以在 2 天后接受您自己的答案）。

标签： xml substitution

【解决方案1】：

使用 GNU awk 为 3rg arg 匹配 () 和 gensub()：

$ cat tst.awk
BEGIN { FS="," }
NR==FNR {
    if ( $2 == "Yes" ) {
        mask[$1]
    }
    next
}
match($0,"(.*<sometag[^<>]+)(.*)(</sometag>.*)",rec) {
    bef = rec[1]
    xml = rec[2]
    aft = rec[3]

    $0 = bef
    while ( match(xml,"(<([^>]+)>)([^<]*)(<[^>]+>)",tagVals) ) {
        tag = tagVals[2]
        if ( tag in mask ) {
            val = tagVals[3]
            masked = gensub(/(.)../,"\\1**","g",val"  ")
            tagVals[3] = substr(masked,1,length(val))
        }
        $0 = $0 tagVals[1] tagVals[3] tagVals[4]
        xml = substr(xml,tagVals[0,"length"])
    }
    $0 = $0 aft
}
{ print }

$ awk -f tst.awk lookup.txt file
2022-02-08 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>9**6**2*</phone><DOB>12-09-77</DOB><bankaccount>4563728495847</bankaccount></sometag>
2022-02-09 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>1**4**7**</phone><DOB>12-09-77</DOB><bankaccount>4563728495847</bankaccount></sometag>
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH

以上假设您发布的预期输出是错误的，并且您并不真正期望 DOB 或银行帐户值会发生变化。

【讨论】：

【解决方案2】：

POSIX awk：

$ cat mask-list
name         1
phone        10010010
DOB          00100100
bankaccount  1001100100100

$ cat mask.sh
awk '
function mask(str, str_masked) {
    for (j=1; j<=length(str); j++) {
        if (substr(masks[i], j, 1)==1) {
            c = substr(str, j, 1)
        } else {
            c = "*"
        }

        str_masked = str_masked c
    }

    return str_masked
}

FNR == NR {
    tags[NR-1] = $1
    masks[NR-1] = $2
}

FNR != NR {
    line = $0

    for (i in tags) {
        regex = "<"tags[i]">[^<]+</"tags[i]">"
        masked_line = ""
        l = length(tags[i])

        while (match(line, regex) > 0) {

            # extract tag value and mask it
            fulltag = substr(line, RSTART, RLENGTH)
            tagval = substr(fulltag, l+3, RLENGTH-l-l-5)
            fulltag_masked = "<"tags[i]">" mask(tagval) "</"tags[i]">"

            # append the line portion before tag, and the masked tag
            masked_line = masked_line substr(line, 1, RSTART-1) fulltag_masked

            # truncate line start to end of matched tag, for next match
            line = substr(line, RSTART + RLENGTH)
        }

        line = masked_line line
    }

    print line
}' "$@"

例子：

$ sh mask.sh mask-list log-file
2022-02-08 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>9**6**2*</phone><DOB>**-**-**</DOB><bankaccount>4**37**4**8**</bankaccount></sometag>
2022-02-09 [this is the actual log file] BLAH - This is how my xml log file is <sometag id="00000-00000"<name>s**</name><phone>1**4**7**</phone><DOB>**-**-**</DOB><bankaccount>4**37**4**8**</bankaccount></sometag>
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH
2022-02-08 [this is the actual log file] BLAH

解释：

读取掩码文件以获取要掩码的标签及其掩码模式。
对于日志文件，使用match() 和substr() 获取变量中每个相关的<tag>val</tag> 子字符串，并再次使用substr() 获取val，因此可以将其传递给一个函数根据当前标签的模式屏蔽它。
为每个标签重新组装和重复。
掩码列表包括相应的掩码模式。 1 表示显示，任何其他字符（或无，参见name）表示隐藏。您可以添加更多条目。
在mask() 函数中，有一个未使用的参数str_masked，用于保持该变量的本地范围。
substr() 用于一次比较一个字符的字符串和掩码模式。

【讨论】：