【问题标题】:Extract xml chunks from plain-text log file从纯文本日志文件中提取 xml 块
【发布时间】:2015-04-03 15:36:15
【问题描述】:

我有一个包含 SOAP 请求/响应条目的日志:

[2015-02-03 19:05:13] TIME:03.02.2015 19:05:13,
                   RAW_REQUEST:<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
</SOAP-ENV:Body></SOAP-ENV:Envelope>
,
                   uid:0de7d51a-abb6-11e4-a436-005056936d96,
                   ===

我想将所有 xml 提取到一个大 xml 文件中(提取块并用 root ... 标签包装)。但我也需要日志记录的日期。

我想要(我可以手动添加的根 xmlns 属性)达到相同的结果:

<Records xmlns="" ...>
    <Record datetime="2015-02-03 19:05:13">
        <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body>
            <!-- Other xml data -->
        </SOAP-ENV:Body></SOAP-ENV:Envelope>
    </Record>
    ...
</Records>

【问题讨论】:

    标签: regex xml linux sed grep


    【解决方案1】:

    你可以使用 awk 来做到这一点

    例如创建一个名为awkscript 的文件并添加以下代码

    BEGIN{print "\n<Records xmlns=\""}
    $0~/^\[[0-9]{1,4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\]/{
    print "\t<Record datetime=\"" substr($1,2,19),substr($3,1)"\">"
    getline
    while ($0!~/^\[[0-9]{1,4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\]/ && $0!~/^<\/*SOAP-ENV:.*/){getline}
    while($0~/^<\/*SOAP-ENV:.*/){print "\t\t" $0;getline};{print "\t </Record>"}}
    END{print "<\/Records>"}
    

    在 shell 中使用您的文件运行脚本

    awk -f path_to_awkscript  path_to_xml_file > path_to_new_file
    

    示例

    使用带有以下数据的 xml 文件的脚本

    [2015-02-03 19:05:13] TIME:03.02.2015 19:05:13,
                       RAW_REQUEST:<?xml version="1.0" encoding="UTF-8"?>
    <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
    </SOAP-ENV:Body></SOAP-ENV:Envelope>
    ,
                       uid:0de7d51a-abb6-11e4-a436-005056936d96,
                       ===
    
    [2014-11-03 19:05:13] TIME:03.02.2015 19:05:13,
                       RAW_REQUEST:<?xml version="1.0" encoding="UTF-8"?>
    <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
    </SOAP-ENV:Body></SOAP-ENV:Envelope>
    ,
                       uid:0de7d51a-abb6-11e4-a436-005056936d96,
                       ===
    
    
    [2014-12-15 19:05:13] TIME:03.02.2015 19:05:13,
                       RAW_REQUEST:<?xml version="1.0" encoding="UTF-8"?>
    <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
    </SOAP-ENV:Body></SOAP-ENV:Envelope>
    ,
                       uid:0de7d51a-abb6-11e4-a436-005056936d96,
                       ===
    
    </SOAP-ENV:Body></SOAP-ENV:Envelope>
    

    结果

    <Records xmlns="
        <Record datetime="2015-02-03 TIME:03.02.2015">
            <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
            </SOAP-ENV:Body></SOAP-ENV:Envelope>
         </Record>
        <Record datetime="2014-11-03 TIME:03.02.2015">
            <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
            </SOAP-ENV:Body></SOAP-ENV:Envelope>
         </Record>
        <Record datetime="2014-12-15 TIME:03.02.2015">
            <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
            </SOAP-ENV:Body></SOAP-ENV:Envelope>
         </Record>
    </Records>
    

    【讨论】:

      【解决方案2】:

      我找不到使用 grep 或 sed 等 linux 控制台工具的解决方案。 所以我写了一个python脚本。

      import sys
      import re
      
      
      def write_xml_log(out_path, lines):
          u"""
          Joins xml chunks into one document.
          """
          out_fh = open(out_path, 'w+')
          out_fh.write('<?xml version="1.0" encoding="UTF-8"?>\n')
          out_fh.write('<LogRecords>\n')
          out_fh.writelines((
              '<LogRecord>\n{}\n</LogRecord>\n'.format(line) for line in lines))
          out_fh.write('</LogRecords>')
          out_fh.close()
      
      
      def prepare_xml_chunks(log_path):
          u"""
          Prepares xml-chunks.
          """
          log_fh = open(log_path)
      
          record_date_re = re.compile('^\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]')
          envelope_start_re = re.compile('(<(?:[\w_-]+:)?Envelope)(.*)$')
          envelope_end_re = re.compile('(.*</(?:[\w_-]+:)?Envelope>)')
          envelope_complete_re = re.compile(
              '(<(?:[\w_-]+:)?Envelope)(.*?>.*?</(?:[\w_-]+:)?Envelope>)')
      
          record_date = ''
          record_envelope = ''
          state_in_envelope = False
      
          for line in log_fh:
              match_date = record_date_re.match(line)
              match_envelope_start = envelope_start_re.match(line)
              match_envelope_end = envelope_end_re.match(line)
              match_envelope_complete = envelope_complete_re.match(line)
      
              if match_date:
                  record_date = match_date.group(1)
      
              if not state_in_envelope:
                  # One-line envelope
                  if match_envelope_complete:
                      state_in_envelope = False
                      record_envelope = ''
      
                      yield '{} datetime="{}" {}\n'.format(
                          match_envelope_complete.group(1),
                          record_date,
                          match_envelope_complete.group(2))
      
                  # Multi-line envelope start.
                  elif match_envelope_start:
                      state_in_envelope = True
                      record_envelope = '{} datetime="{}" {}\n'.format(
                          match_envelope_start.group(1),
                          record_date,
                          match_envelope_start.group(2))
      
                  # Problem situation.
                  elif match_envelope_end:
                      raise Exception('Envelope close tag without open tag.')
              else:
                  # Multi-line envelope continue.
                  if not match_envelope_end:
                      record_envelope += line
      
                  # Multi-line envelope end.
                  else:
                      record_envelope += match_envelope_end.group(1)
                      yield '{}\n'.format(record_envelope)
      
                      record_envelope = ''
                      state_in_envelope = False
      
          log_fh.close()
      
      
      write_xml_log(sys.argv[2], prepare_xml_chunks(sys.argv[1]))
      

      【讨论】:

        猜你喜欢
        • 2011-02-19
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2014-08-10
        • 1970-01-01
        • 1970-01-01
        • 2021-01-21
        • 1970-01-01
        相关资源
        最近更新 更多