如何检测和删除管道文本的缩进答案

【问题标题】：How to detect and remove indentation of a piped text如何检测和删除管道文本的缩进
【发布时间】：2019-05-30 08:33:24
【问题描述】：

我正在寻找一种方法来删除管道文本的缩进。下面是使用cut -c 9- 的解决方案，它假定缩进为 8 个字符宽。

我正在寻找一种可以检测要删除的空格数的解决方案。这意味着遍历整个（管道）文件以了解用于缩进它的最小空格数（制表符？），然后在每一行删除它们。

运行.sh

help() {
    awk '
    /esac/{b=0}
    b
    /case "\$arg" in/{b=1}' \
    "$me" \
    | cut -c 9-
}

while [[ $# -ge 1 ]]
do
    arg="$1"
    shift
    case "$arg" in
        help|h|?|--help|-h|'-?')
            # Show this help
            help;;
    esac
done

$ ./run.sh --help

help|h|?|--help|-h|'-?')
    # Show this help
    help;;

注意：echo $' 4\n 2\n 3' | python3 -c 'import sys; import textwrap as tw; print(tw.dedent(sys.stdin.read()), end="")' 有效，但我希望有更好的方法（我的意思是，它不仅依赖于比 python 更常见的软件。也许是 awk？我也不介意看到 perl 解决方案。

注意 2：echo $' 4\n 2\n 3' | python -c 'import sys; import textwrap as tw; print tw.dedent(sys.stdin.read()),' 也可以使用 (Python 2.7.15rc1)。

【问题讨论】：

请将该示例输入的所需输出添加到您的问题中。
问题不清楚，您在 help() 中尝试做的事情的目标并不明显。您会看到哪些可能的输入？如果您只想删除前导空格，| sed -e 's/^[[:space:]]+//' 之类的应该可以工作。
@JeffBreadner：我建议将-e 替换为-E 或将+ 替换为\+。
您关心制表符的处理，还是我们只是在这里计算空白字节数？
旁白：可以阅读以解决“请以尽可能多的语言实现 X”的内容通常在这里不受欢迎。问题应该是具体的、about actual problems you face，并且绝对不是开放式的。一旦有了满足您需求的实现，您就不再面临实际问题。想要放弃 Python 依赖以仅使用基线 UNIX 工具（如 awk，所以 dawg 的答案是一个很好的工具）运行是一个公平的“实际问题”。不是出于好奇而想看到更多替代实现。

标签： bash unix pipe

【解决方案1】：

awk 的另一个解决方案，基于dawg’s answer。主要区别包括：

无需为缩进设置任意大的数字，感觉很hacky。
处理带有空行的文本，在收集最低缩进行时不考虑它们。

awk '
  {
    lines[++count] = $0
    if (NF == 0) next
    match($0, /[^ ]/)
    if (length(min) == 0 || RSTART < min) min = RSTART
  }
  END {
    for (i = 1; i <= count; i++) print substr(lines[i], min)
  }
' <<< $'    4\n  2\n   3'

或者都在同一行

awk '{ lines[++count] = $0; if (NF == 0) next; match($0, /[^ ]/); if (length(min) == 0 || RSTART < min) min = RSTART; } END { for (i = 1; i <= count; i++) print substr(lines[i], min) }' <<< $'    4\n  2\n   3'

解释：

将当前行添加到数组中，并增加count变量

{
  lines[++count] = $0

如果行为空，则跳到下一次迭代

  if (NF == 0) next

将RSTART 设置为第一个非空格字符的起始索引。

  match($0, /[^ ]/)

如果min未设置或高于RSTART，则将前者设置为后者。

  if (length(min) == 0 || RSTART < min) min = RSTART
}

在读取所有输入后运行。

END {

遍历数组，每行只打印一个从min 中设置的索引到行尾的子字符串。

  for (i = 1; i <= count; i++) print substr(lines[i], min)
}

【讨论】：

【解决方案2】：

echo $'    4\n  2\n   3\n  \n   more spaces in  the    line\n  ...' | \
(text="$(cat)"; echo "$text" \
| cut -c "$(echo "$text" | sed 's/[^ ].*$//' | awk 'NR == 1 {a = length} length < a {a = length} END {print a + 1}')-"\
)

附解释：

echo $'    4\n  2\n   3\n  \n   more spaces in  the    line\n  ...' | \
(
    text="$(cat)" # Obtain the input in a varibale
    echo "$text" | cut -c "$(
        # `cut` removes the n-1 first characters of each line of the input, where n is:
            echo "$text" | \
            sed 's/[^ ].*$//' | \
            awk 'NR == 1 || length < a {a = length} END {print a + 1}'
            # sed: keep only the initial spaces, remove the rest
            # awk:
            # At the first line `NR == 1`, get the length of the line `a = length`.
            # For any shorter line `a < length`, update the length `a = length`.
            # At the end of the piped input, print the shortest length + 1.
            # ... we add 1 because in `cut`, characters of the line are indexed at 1.
        )-"
)

更新：

可以避免产生sed。根据 Tripleee 的评论，sed 的 s/// 可以替换 awk 的 sub()。这是一个更短的选项，使用 n = match() 就像在 Tripleee 的回答中一样。

echo $'    4\n  2\n   3\n  \n   more spaces in  the    line\n  ...' | \
(
    text="$(cat)" # Obtain the input in a varibale
    echo "$text" | cut -c "$(
        # `cut` removes the a-1 first characters of each line of the input, where a is:
            echo "$text" | \
            awk '
                {n = match($0, /[^ ]/)}
                NR == 1 || n < a {a = n}
                END || a == 0 {print a + 1; exit 0}'
            # awk:
            # At every line, get the position of the first non-space character
            # At the first line `NR == 1`, copy that lenght to `a`.
            # For any line with less spaces than `a` (`n < a`) update `a`, (`a = n`).
            # At the end of the piped input, print a + 1.
            # a is then the minimum number of common leading spaces found in all lines.
            # ... we add 1 because in `cut`, characters of the line are indexed at 1.
            #
            # I'm not sure the whether the `a == 0 {...;  exit 0}` optimisation will let the "$text" be written to the script stdout yet (which is not desirable at all). Gotta test that when I get the time.

        )-"
)

显然，在 Perl 6 中也可以使用函数 my &f = *.indent(*); 来实现。

【讨论】：

sedsn-p 可以重构为 awk 脚本；这是一个简单的sub()，尽管直接找到第一个非空格字符的索引可能更有效。

【解决方案3】：

这是（半）明显的临时文件解决方案。

#!/bin/sh

t=$(mktemp -t dedent.XXXXXXXXXX) || exit
trap 'rm -f $t' EXIT ERR
awk '{ n = match($0, /[^ ]/); if (NR == 1 || n<min) min = n }1
    END { exit min+1 }' >"$t"
cut -c $?- "$t"

如果所有行都有超过 255 个前导空白字符，这显然会失败，因为结果将不适合 Awk 的退出代码。

这样做的好处是我们不会将自己限制在可用内存中。相反，我们将自己限制在可用的磁盘空间内。缺点是磁盘可能较慢，但不将大文件读入内存的优势将在恕我直言。

【讨论】：

不应该是< "$t"而不是> "$t"吗？还有一个1 位于第一条awk 行的末尾，不应该将其删除吗？另外，根据我的回答，应该可以将; if (NR == 1 || n<min) 替换为} NR == 1 || n<min {。
不，我们正在读取标准输入并将结果打印到临时文件中，此时该文件仍为空。 1 是“打印所有内容”的常见 Awk 习语。
我认为没有理由将 if 移动到单独的块中，尽管您是正确的，它在语义上是相同的。
你可以通过cat > "$t"，然后awk '... END {print min + 1}' < "$t" | read c来摆脱255个字符的限制
那么tee呢？

【解决方案4】：

假设你有：

$ echo $'    4\n  2\n   3\n\ttab'
    4
  2
   3
    tab

您可以使用 Unix expand 实用程序将制表符扩展为空格。然后通过awk 计算一行中的最小空格数：

$ echo $'    4\n  2\n   3\n\ttab' | 
expand | 
awk 'BEGIN{min_indent=9999999}
     {lines[++cnt]=$0
      match($0, /^[ ]*/)
      if(RLENGTH<min_indent) min_indent=RLENGTH
     }
     END{for (i=1;i<=cnt;i++) 
               print substr(lines[i], min_indent+1)}'
  4
2
 3
      tab

【讨论】：

注意：在这篇文章中，\t 有点搞砸了。它实际上是一个制表符，然后由expand 转换为 8 个空格，然后 awk 将其减少为 6 个空格...

【解决方案5】：

以下是纯 bash，没有外部工具或命令替换：

#!/usr/bin/env bash
all_lines=( )
min_spaces=9999 # start with something arbitrarily high
while IFS= read -r line; do
  all_lines+=( "$line" )
  if [[ ${line:0:$min_spaces} =~ ^[[:space:]]*$ ]]; then
    continue  # this line has at least as much whitespace as those preceding it
  fi
  # this line has *less* whitespace than those preceding it; we need to know how much.
  [[ $line =~ ^([[:space:]]*) ]]
  line_whitespace=${BASH_REMATCH[1]}
  min_spaces=${#line_whitespace}
done

for line in "${all_lines[@]}"; do
  printf '%s\n' "${line:$min_spaces}"
done

它的输出是：

  4
2
 3

【讨论】：