删除日志文件中的一些行答案

【问题标题】：remove some lines in log file删除日志文件中的一些行
【发布时间】：2017-01-02 01:26:41
【问题描述】：

我有一个很大的日志文件。

去掉每一行的时间戳后，我按照cat logfile | sort -u > logfile排序，这样日志就干净整洁了

failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.349 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.350 because of divided by zero
.
. (lines not shown here)
.
failed to correct PL.ASBF..HHZ.2015.364 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
.
.
. (lines not shown here)
.
.
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
.
. (lines not shown here)
.
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format

我可以通过

获取记录的项目（例如上面示例中的PL.HSPB）

grep -oE " [0-9A-Z]*\.[0-9A-Z]*" logfile | sort -u

但是，我也想知道日期信息并使其更清晰，我想删除中间行。例如，

failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
.
. (lines not shown here)
.
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format

移除后变成

failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format

即，对于一个项目，只保留第一行和最后一行（数字是年份和儒略日）。

有没有什么shell命令可以轻松搞定？

【问题讨论】：

@shellter 日志文件中的日期，如2014.365，为'year.jday'。无需从月份和日期计算朱利安日期。
Doah，我错过了 Y.Jday 在您的原始数据中。祝你好运。
只需将另一个grep 添加到产生当前输出的管道中？ grep "^failed to connect" logfile | grep -oE " [0-9A-Z]*\.[0-9A-Z]*" | sort -u ?祝你好运。

标签： shell logging text text-processing

【解决方案1】：

脚本：

$ cat hhz.py
#!/usr/bin/env python

import sys, re
from collections import OrderedDict

undateds = set()
firsts   = OrderedDict()
lasts    = OrderedDict()

while True:
  line = sys.stdin.readline()
  if line == '':
    break
  line = line.rstrip("\n")

  x = re.match("(.*HHZ\.)([0-9][0-9][0-9][0-9]\.[0-9]+)( .*)", line)
  if x is None:
    continue

  before = x.group(1)
  during = x.group(2)
  after  = x.group(3)
  undated = re.sub("(.*HHZ\.)[0-9][0-9][0-9][0-9]\.[0-9]+ (.*)", line, before+after)

  if not undated in firsts:
    firsts[undated] = line
  lasts[undated] = line

for undated in firsts:
  first = firsts[undated]
  last  = lasts[undated]
  print first
  if first != last:
    print last

输入：

$ cat hhz.dat
failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.349 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.350 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.364 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Something else
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format

输出：

$ hhz.py < hhz.dat
failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Something else

通过对日期部分进行正则表达式进行分组。 undated 是唯一的名称。
如果尚未设置，则通过执行有序 dict put 获得小组第一。
通过无条件执行ordered-dict put 获得小组最后一位。
使用OrderedDict 来保留输入文件的顺序（如果您不想这样做，请使用dict）
检查first != last 以避免在组中只有一项的情况下打印相同的内容两次

【讨论】：