使用 awk 在列中排列数据答案

【问题标题】：Arranging data in columns using awk使用 awk 在列中排列数据
【发布时间】：2021-04-23 14:22:17
【问题描述】：

我有一个数据 300 输出 .out 文件，我需要从中获取数据。通常数据存储在其中：

PROPERTY 1:   1234
lines 
of 
unimportant text
PROPERTY 2: 1334
lines 
of 
unimportant text
PROPERTY 3: 1237
.
.
.
PROPERTY N: 7592

我有 300 个这样的文件。

我想从这些文件中提取数据并将它们排列成整齐的列。 PROPERTY 1 的所有数据点一列，PROPERTY 2 的一列，...，PROPERTY N 的一列。最终目标是使用 python 和 pandas 进一步处理数据。

我正在使用 awk 来提取这些数据。

我有两种方法可以做到这一点，但每种方法都有问题。方法一： awk '/PROPERTY 1/{p1=$NF; } /PROPERTY 2/{p2=$NF} /PROPERTY 3/... {pn=$NF; print p1, p2, p3,...}' *.out 这种方法有两个问题：

我可以提取单个数据点并将它们存储到文件中，但是，这是一个很长的程序。此外，如果 PROPERTY 1 和 PROPERTY 2 的位置被翻转，此代码将给出错误输出，即 outputfile1.out 中的 PROPERTY 1 将显示在第 2 行，而不是第 1 行。我该如何做到这一点输出没有故障？

我的第二种方法是简单地将它们输出到不同的文件中，然后使用 python 将它们连接在一起。有没有办法从 file1 中提取一列并使用 awk 将其粘贴到文件 2 中的列旁边？

示例输入文件：

first.out：

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 
PROPERTY 1:    1234

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit
PROPERTY 2:    9800

At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga.

PROPERTY 4:   823586

On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain.

PROPERTY 3:   328497
.
.
.

秒出：

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 
PROPERTY 1:    1

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit
PROPERTY 2:    2

At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga.

PROPERTY 3:   3

On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain.

PROPERTY 4:   4
.
.
.

每个文件都有所有属性。

预期的输出文件：数据.txt

1234  9800  823586  328497 ...
1  2  3  4 
.
.
.

我正在尝试优化我的代码，而 awk 似乎速度极快。您的任何建议将不胜感激！

【问题讨论】：

能否请您在问题中添加更清晰的示例输出，以便更好地理解您的问题，谢谢。
您的目标是在第 i 列有一个带有 PROPERTY i 的pandas.DataFrame，并且每一行对于 300 个文件中的每一个都是一个值？
是的@crissal，这就是目标
@RavinderSingh13 我添加了一个示例输入文件和一个示例输出文件
当“最终目标是使用 python 和 pandas”时，为什么要用 awk 标记它？如果最终目标是使用 python，那么就使用 python。

标签： python awk python-textprocessing

【解决方案1】：

对 ENDFILE 使用 GNU awk 并假设您有一个特定的 PROPERTY 标记子集要打印，但并非所有这些标记都存在于每个文件中（您发布的示例对此并不清楚，或者所有属性都以财产等）：

$ cat tst.awk
BEGIN {
    numTags = split("PROPERTY 1,PROPERTY 2,PROPERTY 3,PROPERTY 4",tags,/,/)
}
{
    tag = $0
    sub(/:.*/,"",tag)
    f[tag] = $NF
}
ENDFILE {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        val = f[tag]
        printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
    }
    delete f
}

$ awk -f tst.awk first second
1234 9800 328497 823586
1 2 3 4

【讨论】：

【解决方案2】：

我会逐行分析：

import re

RE_PROPERTY = re.compile(r"^PROPERTY\s*([0-9]+)\s*:\s*(.*)\s*\n$")

columns = {}

with open("data.out", "r") as f:
    for line in f.readlines():
        m = RE_PROPERTY.match(line)
        if m:
            key = f"PROPERTY {m.group(1)}"
            value = m.group(2)
            col = columns.setdefault(key, [])
            col.append(value)

print(columns)

【讨论】：