分解bash中的文本文件答案

【问题标题】：Break down text file in bash分解bash中的文本文件
【发布时间】：2013-06-08 20:48:29
【问题描述】：

我有一个格式如下的文本文件：

variableStep chrom=chr1 span=10
10161   1
10171   1
10181   2
10191   2
10201   2
10211   2
10221   2
10231   2
10241   2
10251   1
variableStep chrom=chr10 span=10
70711   1
70721   2
70731   2
70741   2
70751   2
70761   2
70771   2
70781   2
70791   1
71161   1
71171   1
71181   1
variableStep chrom=chr11 span=10
104731  1
104741  1
104751  1
104761  1
104771  1
104781  1
104791  1
104801  1
128711  1
128721  1
128731  1

我需要一种方法将其分解为多个文件，例如“chr1.txt”、“chr10.txt”和“chr11.txt”。我该怎么做呢？

我关于以下方式：

cat file.txt | \
while IFS=$'\t' read  -r -a rowArray; do
    echo -e "${rowArray[0]}\t${rowArray[1]}\t${rowArray[2]}"
done > $file.mod.txt

逐行读取，然后逐行保存。但是，我需要一些更精细的跨越行的东西。 “chr1.txt”将包括从第 10161 1 行到第 10251 1 行的所有内容，“chr10.txt”将包括从第 70711 1 行到第 71181 1 行的所有内容，等等。这也是具体的，我必须阅读实际chr# 也从每一行，并将其保存为文件名。

非常感谢您的帮助。

【问题讨论】：

标签： regex linux bash unix io

【解决方案1】：

Awk 适合这个问题域，因为文本文件已经（或多或少）组织成列。这是我将使用的：

awk 'NF == 3 && index($2, "=") { filename = substr($2, index($2, "=") + 1) }
     NF == 2 && filename { print $0 > (filename ".txt") }' < input.txt

解释：

将 variableStep 开头的行视为“三列”，将其他行视为“两列”。上面的脚本说，“逐行解析文本文件；如果一行有三列，第二列包含一个'='字符，则分配'第二列中出现在'='之后的所有字符字符'到一个名为filename的变量。如果一行有两列并且分配了filename变量，则将整行写入通过将文件名变量中的字符串与'.txt'连接起来构建的文件。

注意事项：

NF 是 Awk 中的一个内置变量，表示“字段数”，其中“字段”（在本例中）可以视为一列数据。
$0 和 $2 是内置变量，分别代表整行和第二列数据。（$1 代表第一列，$3 代表第三列，依此类推...）
substr 和 index 是这里描述的内置函数：http://www.gnu.org/software/gawk/manual/gawk.html#String-Functions 重定向操作符 (>) 在 Awk 中的作用与在 shell 脚本中的作用不同；对同一文件的后续写入被追加。
字符串连接只需将表达式彼此相邻编写即可。括号确保连接发生在文件被写入之前。

更多细节可以在这里找到：http://www.gnu.org/software/gawk/manual/gawk.html#Two-Rules

【讨论】：

【解决方案2】：

awk -F'[ =]' '
  $1 == "variableStep" {file = $3 ".txt"; next}
  file != "" {print > file}' < input.txt

【讨论】：

【解决方案3】：

我用 sed 过滤 ....

代码部分：

改善~/so_test $ cat zsplit.sh

cntr=1;
prev=1;
for curr in `cat ztmpfile2.txt | nl | grep variableStep | tr -s " " | cut -d" " -f2 | sed -n 's/variableStep//p'`
do
sed -n "$prev,$(( ${curr} - 1))p" ztmpfile2.txt > zchap$cntr.txt ;
#echo "displaying : : zchap$cntr.txt " ;
#cat zchap$cntr.txt ;
prev=$curr; cntr=$(( $cntr + 1 ));
done

 sed -n "$prev,$ p" ztmpfile2.txt > zchap$cntr.txt ;
 #echo "displaying : : zchap$cntr.txt " ;
 #cat zchap$cntr.txt ;

输出：

Kaizen ~/so_test $  ./zsplit.sh
+ ./zsplit.sh
zchap1.txt :: 1 :: 1
displaying : : zchap1.txt
variableStep chrom=chr1 span=10
zchap2.txt :: 1 :: 12
displaying : : zchap2.txt
variableStep chrom=chr1 span=10
10161   1
10171   1
10181   2
10191   2
10201   2
10211   2
10221   2
10231   2
10241   2
10251   1
zchap3.txt :: 12 :: 25
displaying : : zchap3.txt
 variableStep chrom=chr10 span=10
70711   1
70721   2
70731   2
70741   2
70751   2
70761   2
70771   2
70781   2
70791   1
71161   1
71171   1
71181   1
displaying : : zchap4.txt
variableStep chrom=chr11 span=10
104731  1
104741  1
104751  1
104761  1
104771  1
104781  1
104791  1
104801  1
128711  1
128721  1
128731  1

从结果 zchap* 文件中，如果您希望可以删除该行：variableStep chrom=chr11 span=10 使用 sed -- sed -i '/variableStep/d' zchap*

这有帮助吗？

【讨论】：

【解决方案4】：

这对我有用：

IFS=$'\n'
curfile=""
content=($(< file.txt))
for ((idx = 0; idx < ${#content[@]}; idx++)); do
    if [[ ${content[idx]} =~ ^.*chrom=(\\b.*?\\b)\ .*$ ]]; then
        curfile="${BASH_REMATCH[1]}.txt"
        rm -rf ${curfile}
    elif [ -n "${curfile}" ]; then
        echo ${content[idx]} >> ${curfile}
    fi
done

【讨论】：

感谢您的回复。该代码创建了“内容”变量，但似乎它只是通过 for 循环而没有任何实际操作。据我所知，if 语句正在查看是否找到了“chrom”字符串，但是，它是一个 ~= 逻辑，所以我不确定这是否是什么意思。然后，不会创建 curfile 并且不会执行 elif。你有什么想法？
对于确实具有 chrom 模式的行，curfile 变量将填充 chr1.txt、chr10.txt 和 chr11.txt。 elif 部分针对其中没有 chrom 模式的每一行调用，并附加到当前行 curfile 中的文件。这就是我理解的你想要努力的方向。