从列表中提取字符串匹配文件答案

【问题标题】：extract string matching file from the list从列表中提取字符串匹配文件
【发布时间】：2020-09-30 00:01:52
【问题描述】：

我有一个名为file.out 的列表，其中包含如下所示的文件：

file                                                            a    b    c    d   e
DS_swe/msg.rti-20160510_5_1.0_rnt.txt-20190415_8_2.0_rnt.txt  0.5  1.0  1.5  1.3 2.0
DS_swe/msg.rti-20105510_5_1.0_rnt.txt-20200415_8_2.0_rnt.txt  0.6  2.0  2.5  1.2 4.0
DS_swe/msg.rti-20190510_5_1.0_rnt.txt-20250415_8_2.0_rnt.txt  0.2  8.0  3.5  1.1 6.0
DS_swe/msg.rti-20102510_5_1.0_rnt.txt-20240415_8_2.0_rnt.txt  0.1  2.5  1.2  1.0 8.0
DS_swe/msg.rti-20145510_5_1.0_rnt.txt-20140415_8_2.0_rnt.txt  0.8  2.2  1.4  1.9 5.0

我还有一个名为 data 的目录，其中包含类似的文件

data/
├── 20160510_5_1.0_rnt.txt
├── 20105510_5_1.0_rnt.txt
├── 20190510_5_1.0_rnt.txt
├── 20102510_5_1.0_rnt.txt
└── 20145510_5_1.0_rnt.txt

这些文件名与上面列出的数据部分匹配，例如：

DS_swe/msg.rti-????????_?_?_???.???-20190415_8_2.0_rnt.txt 0.5  1.0  1.5  1.3 2.0.

此外，该目录的所有.txt 文件都包含4 行，如下所示。例如20160510_5_1.0_rnt.txt 包含：

20.0  23.0  25.0  45.0  78.0  sy
14.0  12.0  24.0  45.0  78.0  tx
14.0  25.0  25.0  47.0  78.0  mx
12.0  25.0  32.0  47.0  56.0  cx

所以我需要做的是：如果目录中的文件（.txt）与上述列表的? 标记字符串匹配 ::: 那么我想从匹配的 .txt 文件中提取第 3 列和第 4 列在目录内，另外想提取列表中相应文件的第 5 和第 6 列值（file.out），并希望在相应的.txt 文件中附加相同的第 5 和第 6 列值的重复值，最后想保存相同的.txt 文件在不同的目录中命名为results

例如：文件20160510_5_1.0_rnt.txt 的预期输出如下

25.0  45.0  1.3  2.0
24.0  45.0  1.3  2.0
25.0  47.0  1.3  2.0
32.0  47.0  1.3  2.0

为了解决上述问题，我尝试了以下代码，但在我需要专家帮助的主要部分上卡住了。提前致谢。

#!/bin/sh
for file in /home/lijun/data/*.txt
    grep "*.txt" file.out > file
    cat file | if

【问题讨论】：

这是向多个方向延伸的。今后，请尽量一次只专注于一个问题。另请参阅将您的问题简化为minimal reproducible example 的指南。

标签： bash awk sed grep

【解决方案1】：

更新为包含基于 OP 示例 for 循环的输入/输出文件目录

一个（有点）冗长的解决方案......

使用awk从file.out中提取文件名和字段5和6：

$ awk '{ split($1,fn,"-"); print fn[2],$5,$6 }' file.out
20160510_5_1.0_rnt.txt 1.3 2.0
20105510_5_1.0_rnt.txt 1.2 4.0
20190510_5_1.0_rnt.txt 1.1 6.0
20102510_5_1.0_rnt.txt 1.0 8.0
20145510_5_1.0_rnt.txt 1.9 5.0

地点：

使用默认的空格分隔符输入字段
split($1,fn,"-") - 使用 "-" 作为字段分隔符将第一个字段拆分为数组 fn
print fn[2],$5,$6 - 输出文件名和字段 5 & 6

我们现在将使用第二个awk 解决方案循环遍历此列表，以从文件中提取字段 3 和 4 并附加字段 5 和 6（来自 file.out）：

# OP will need to update the following variables to ensure they reference the correct directory where the input/output files are located:

$ in_dir="/home/lijun/data"
$ out_dir="/home/lijun/results"

$ while read -r fname field5 field6
do
    # I only have one file in my system so I'll print a warning about files I can't find
  
    [ ! -f "${in_dir}/${fname}" ]                                         && \
    echo "WARNING: Unable to locate file '${in_dir}/${fname}'. Skipping." && \
    continue

    echo "Processing file '${in_dir}/${fname}' ..."

    # pass fields 5 & 6 into `awk` using `-v`; print out desired fields

    awk -v f5="${field5}" -v f6="${field6}" '{ print $3,$4,f5,f6 }' "${in_dir}/${fname}" > "${out_dir}/${fname}"

done < <(awk '{ split($1,fn,"-"); print fn[2],$5,$6 }' file.out)

在我的系统上运行上述代码会生成：

Processing file '20160510_5_1.0_rnt.txt' ...
WARNING: Unable to locate file '20105510_5_1.0_rnt.txt'. Skipping.
WARNING: Unable to locate file '20190510_5_1.0_rnt.txt'. Skipping.
WARNING: Unable to locate file '20102510_5_1.0_rnt.txt'. Skipping.
WARNING: Unable to locate file '20145510_5_1.0_rnt.txt'. Skipping.

$ cat 20160510_5_1.0_rnt.txt.2
25.0 45.0 1.3 2.0
24.0 45.0 1.3 2.0
25.0 47.0 1.3 2.0
32.0 47.0 1.3 2.0

【讨论】：

我怎样才能将输出重定向到一个目录....你能更新一下吗
查看注释掉的行以及在out_fname 变量中添加目标目录作为前缀的示例
但它如何在 .txt 文件所在的目录中搜索匹配文件
20160510_5_1.0_rnt.txt 这种文件存在于目录中
更新假设data和results目录位于/home/lijun/下；如果这不正确，只需更新 in_dir 和 out_dir 变量

【解决方案2】：

您可以使用第一个 awk 解析 file.out 以获取其第一列与以下 regex 模式匹配的所有行：

/DS_swe\/msg.rti-(.+)-[0-9]{8}_[0-9]_[0-9].[0-9]_rnt.txt/

在这里，(.+) 捕获文件名并将其存储到 \1。

所以要运行的 awk 行是：

awk '{
       # Replace the first column with only the related filename in datas
       # and store it in f.
       f=gensub(/DS_swe\/msg.rti-(.+)-[0-9]{8}_[0-9]_[0-9].[0-9]_rnt.txt/,
                "\\1", "1", $1)
       # If the value doesn't match the pattern, f will contain the column value
       # So don't print anything.
       if  (f != $1) print f" "$5" "$6
     } < file.out'

你会得到这样的行：

20160510_5_1.0_rnt.txt 1.3 2.0

然后用read获取每一列的值：

read f c5 c6 # stores the filename in $f, the 5th column in $c5, the 6th in $c6

至少，使用这些值运行另一个 awk：

# Parse data/"$f" file and for each line
# print the 3rd and 4th columns with "$c5 $c6" text
awk '{ print $3" "$4" '"$c5 $c6"'" }' <data/$f

然后您可以使用第二个 awk 调用来处理输出：

最终工作示例（&& 代表逻辑 AND；如果读取未遇到文件结尾，则运行以下命令）：

awk '{
       f=gensub(/DS_swe\/msg.rti-(.+)-[0-9]{8}_[0-9]_[0-9].[0-9]_rnt.txt/, "\\1",
                "1", $1)
       if  (f != $1) print f" "$5" "$6
      }' < file.out | {
                        read f c5 c6 &&
                        awk '{ print $3" "$4" '"$c5 $c6"'" }' <data/$f ;
                      }

结果：

25.0 45.0 1.3 2.0
24.0 45.0 1.3 2.0
25.0 47.0 1.3 2.0
32.0 47.0 1.3 2.0

【讨论】：

在这段代码中，我需要提取许多文件，而不仅仅是 20160510_5_1.0_rnt.txt 1.3 2.0 .....如何自动化它......正如你提到的那样 (.+)-20190415_8_2。 0_rnt.txt
@lijun，已编辑答案。顺便说一句，您确实应该首先尝试了解给定代码中发生了什么。