在 Python 中组合日志文件并按时间排序答案

【问题标题】：Combining Log Files and Sorting by Time in Python在 Python 中组合日志文件并按时间排序
【发布时间】：2015-08-10 19:02:09
【问题描述】：

首先，我对 Python 和一般编程非常陌生，所以如果这是一个显而易见的问题，请多多包涵。

我有一个未定义数量（可能超过 10 个）的日志文件与目录中的其他随机文件混合在一起，我需要将这些文件合并到一个文件中，其中的行按每行开头的时间戳排序。日志文件是 .txt，并且在同一目录中还有其他非日志 .txt 文件，因此我将让该脚本的用户输入每个日志文件作为参数。

现在，在您将此标记为重复之前，我浏览了此处的 4 页搜索结果，没有一个问题有我可以使用的答案。

到目前为止，我有以下工作的 Python 代码：

log_file_name = 'logfile.txt'

import sys
import fileinput
from Tkinter import Tk
from tkFileDialog import askopenfilenames

logfile = open(log_file_name, 'w+')
logfile.truncate()
logfile.seek(0)

# get list of file names
print "Opening File Dialog"
Tk().withdraw()
files = askopenfilenames(title='Select all logs you would like to compile.')

for index in range(len(files)):
    print "Loop ", index
    print "--- Debug message: Reading a file... ---"
    logdata = open((files[index])).readlines()
    print "--- Debug message: Finished reading. Writing a file... ---"
    # turns logdata into a string and writes it to logfile
    logfile.write(''.join(logdata))
    logfile.write("\n")

print ""
print "Exited for loop."
logfile.close()

上述代码将您选择的所有文件的内容放入一个文本文件中，但它不会对它们进行排序。

我正在考虑使用正则表达式搜索括号内的数字，然后根据它对每一行进行排序...？

以下是一些示例日志文件内容。

[xx.xxxxxx] [Text] Text : Text: xxx
[xx.xxxxxx] [Text] Text : Text: xxx
[xx.xxxxxx] [Text] Some text.
There could be multiple lines of text here
These lines could include [brackets.] :(

[xx.xxxxxx] [Text] Text : Text: xxx

[xx.xxxxxx] 是系统启动后的时间戳，以秒为单位。

【问题讨论】：

日志文件的布局是什么（时间戳在哪里），生成的文件有多大。如果生成的文件可以很容易地存储在内存中，那么您可以使用简单的排序。如果没有，您必须划分记录并对第一组进行排序（例如最早的时间戳），将其写入文件，对下一组进行排序，写入等等。
@CurlyJoe 我编辑了我的问题以添加一些示例日志文本。
@CurlyJoe 将所有日志加载到内存中是完全没问题的。

标签： python sorting file-io

【解决方案1】：

由于时间戳在每条记录的开头，您只需排序即可。如果花费的时间太长，那么您可能希望在输入时对每个日志文件进行排序并合并到最终列表中

import pprint

file_1="""[92.5] Text Text : Text: xxx
[91.5] Text Text : Text: xxx"""

file_2="""[91.7] [Text] Some text.
Some text of variable size, may be on multiple lines. Number of lines is variable also. 
[90.5] [Text] Some text.
Some text of variable size, may be on multiple lines. Number of lines is variable also."""

## Write data to some test log files
with open("./log_1.txt", "w") as fp_out:
    fp_out.write(file_1)
with open("./log_2.txt", "w") as fp_out:
    fp_out.write(file_2)

def input_recs(f_name):
    recs=open(f_name, "r").readlines()
    ## assume you want to omit text only lines
    return_list=[rec.strip() for rec in recs if rec[1].isdigit()]
    return return_list

sorted_list=[]
for f_name in ["log_1.txt", "log_2.txt"]:
    recs=input_recs(f_name)
    sorted_list.extend(recs)

sorted_list.sort()
pprint.pprint(sorted_list)

【讨论】：

我想保持日志的所有内容完好无损，并且脚本需要能够处理无限数量的参数（这将是日志文件名），否则这看起来不错。我明天试试。 :)
看看stackoverflow.com/questions/3579568/… 以使用 Tkinter 选择文件（网络上的许多其他示例）。
哇，这比让用户输入文件名要好得多，谢谢！
请你看看我更新的代码@CurlyJoe
我无法让您的代码正常工作。我对此非常陌生，所以如果这是愚蠢的事情，我深表歉意。如果您需要，我明天可以给您发送错误消息。

【解决方案2】：

当您没有得到好的答案时，这意味着您没有提出好的问题。 “识别每条消息，然后对消息进行排序，而不是每一行”是什么意思。为了说明一般如何执行此操作，我将假设您希望没有时间戳的行包含在前一个时间戳中。您必须以某种可以在某些记录（ord）上排序的顺序获取数据。使用字典或列表列表有两种方法可以做到这一点。以下使用列表列表，并将非时间戳记录（ord）简单地附加到前一个时间戳记录中，因此所有记录都以时间戳开头并且可以对列表进行排序。至此，您应该了解所涉及的一般原则。

file_1="""[92.5] [Text1[ Text : Text: xxx
[91.5] [Text2[ Text : Text: xxx
[92.5] [Text2.5] Some text.
[90.5] [Text3] Some text"""

file_2="""[91.7] [Text4] Some text.
Some text of variable size, may be on multiple lines. Number of lines is variable also. 
[90.5] [Text5] Some text.
Some text of variable size, may be on multiple lines. Number of lines is variable also."""

## Write data to some test log files
with open("./log_1.txt", "w") as fp_out:
    fp_out.write(file_1)
with open("./log_2.txt", "w") as fp_out:
    fp_out.write(file_2)

def input_recs(f_name):
    return_list=[]
    append_rec=""
    with open(f_name, "r")as fp_in:
        for rec in fp_in:
            if rec[1].isdigit():
                ## next time stamp so add append_rec to return_list and
                ## create a new append_rec that contains this record
                if len(append_rec): 
                    return_list.append(append_rec)
                append_rec=rec
            else:
                append_rec += rec  ## not a time stamp

    ## add last rec
    if len(append_rec): 
        return_list.append(append_rec)

    return return_list

sorted_list=[]
for f_name in ["log_1.txt", "log_2.txt"]:
    recs_list=input_recs(f_name)
    sorted_list.extend(recs_list)

sorted_list.sort()
import pprint
pprint.pprint(sorted_list)  ## newlines are retained

【讨论】：

我所说的识别消息而不是行的意思是每行可能包含几个换行符，所以我不能只对行进行排序。