使用 bash 根据正则表达式将文件夹中的所有 txt 文件拆分为较小的文件答案

【问题标题】：Splitting all txt files in a folder into smaller files based on a regular expression using bash使用 bash 根据正则表达式将文件夹中的所有 txt 文件拆分为较小的文件
【发布时间】：2013-04-22 18:05:23
【问题描述】：

我有一个包含大文本文件的文件夹。每个文件是由 [[ file name ]] 分隔的 1000 个文件的集合。我想拆分文件并从中制作 1000 个文件并将它们放在一个新文件夹中。 bash有没有办法做到这一点？任何其他快速方法也可以。

for f in $(find . -name '*.txt')
do mkdir $f
  mv 
  cd $f
  awk '/[[.*]]/{g++} { print $0 > g".txt"}' $f
  cd ..
done

【问题讨论】：

是的，有办法做到这一点。你自己尝试过什么？
for f in $(find .-name '*.txt');做 mkdir $f; mv ;cd $f;awk '/[[.*]]/{g++} { print $0 > g".txt"}' $f;cd ..;完成
但它给出了一个错误，文件名已在文件夹中使用

标签： regex bash text split

【解决方案1】：

您正在尝试创建一个与现有文件同名的文件夹。

for f in $(find . -name '*.txt')
do mkdir $f

在这里，“查找”将列出当前路径中的文件，并且您将尝试为每个文件创建一个名称完全相同的目录。一种方法是首先创建一个临时文件夹：

for f in $(find . -name '*.txt')
do mkdir temporary # create a temporary folder
  mv $f temporary # move the file into the folder
  mv temporary $f # rename the temporary folder to the name of the file
  cd $f # enter the folder and go on....
  awk '/[[.*]]/{g++} { print $0 > g".txt"}' $f
  cd ..
done

请注意，您的所有文件夹都将具有“.txt”扩展名。如果你不想这样，你可以在创建文件夹之前把它剪掉；这样，您将不需要临时文件夹，因为您尝试创建的文件夹与 .txt 文件的名称不同。示例：

for f in $(find . -name '*.txt' | rev | cut -b 5- | rev)

【讨论】：

【解决方案2】：

虽然不是awk和醉汉写的，但不保证能工作。

import re
import sys


def main():
    pattern = re.compile(r'\[\[(.+)]]')
    with open (sys.argv[1]) as f:
        for line in f:
            m = re.search(pattern, line)
            if m:
                try:
                    with open(fname, 'w+') as g:
                        g.writelines(lines)
                except NameError:
                    pass
                fname = m.group(1)
                lines = []
            else:
                lines.append(line)

    with open(fname, 'w+') as g:
        g.writelines(lines)

if __name__ == '__main__':
    main()

【讨论】：

【解决方案3】：

编写一个 bash 脚本。在这里，我已经为你完成了。

注意这个脚本的结构和特点：

解释它在usage() 函数中的作用，该函数用于-h 选项。
提供一组标准选项：-h、-n、-v。
使用getopts进行选项处理
对参数进行大量错误检查
注意文件名解析（注意文件名周围的空格会被忽略。
在函数中隐藏细节。注意到“talk”、“qtalk”、“nvtalk”功能了吗？这些来自我构建的 bash 库，以使这种脚本编写变得容易。
向用户解释在$verbose 模式下会发生什么。
让用户能够在不实际执行的情况下查看会执行的操作（-n 选项，用于$norun 模式）。
永远不要直接运行命令。但使用run函数，注意$norun、$verbose和$quiet变量。

我不只是为你钓鱼，而是教你如何钓鱼。

祝你的下一个 bash 脚本好运。

艾伦·S.

#!/bin/bash
# split-collections IN-FOLDER OUT-FOLDER

PROG="${0##*/}"

usage() {
  cat 1>&2 <<EOF
usage: $PROG [OPTIONS] IN-FOLDER OUT-FOLDER

This script splits a collection of files within IN-FOLDER into
separate, named files into the given OUT-FOLDER.  The created file
names are obtained from formatted text headers within the input
files.

The format of each input file is a set of HEADER and BODY pairs,
where each HEADER is a text line formatted as:

    [[input-filename1]]
    text line 1
    text line 2
    ...
    [[input-filename2]]
    text line 1
    text line 2
    ...

Normal processing will show the filenames being read, and file
names being created.  Use the -v (verbose) option to show the
number of text lines being written to each created file.  Use
-v twice to show the actual lines of text being written.

Use the -n option to show what would be done, without actually
doing it.

Options
 -h       Show this help
 -n       Dry run -- do NOT create any files or make any changes
 -o       Overwrite existing output files.
 -v       Be verbose

EOF
   exit
}

talk()   { echo 1>&2 "$@" ; }
chat()   { [[ -n "$norun$verbose" ]] && talk "$@" ; }
nvtalk() { [[ -n "$verbose" ]] || talk "$@" ; }
qtalk()  { [[ -n "$quiet" ]]   || talk "$@" ; }
nrtalk() { talk "${norun:+(norun) }$@" ; }

error() { 
  local code=2
  case "$1" in [0-9]*) code=$1 ; shift ;; esac
  echo 1>&2 "$@"
  exit $code
}

talkf()   { printf 1>&2 "$@" ; }
chatf()   { [[ -n "$norun$verbose" ]] && talkf "$@" ; }
nvtalkf() { [[ -n "$verbose" ]] || talkf "$@" ; }
qtalkf()  { [[ -n "$quiet" ]]   || talkf "$@" ; }
nrtalkf() { talkf "${norun:+(norun) }$@" ; }

errorf()  { 
  local code=2
  case "$1" in [0-9]*) code=$1 ; shift ;; esac
  printf 1>&2 "$@"
  exit $code
}

# run COMMAND ARGS ...

qrun() {
  ( quiet=1 run "$@" )
}

run() {
  if [[ -n "$norun" ]]; then
    if [[ -z "$quiet" ]]; then
      nrtalk "$@"
    fi
  else
    if [[ -n "$verbose" ]]; then
      talk ">> $@"
    fi
    if ! eval "$@" ; then
      local code=$?
      return $code
    fi
  fi
  return 0
}

show_line() {
  talkf "%s:%d: %s\n" "$in_file" "$lines_in" "$line"
}

# given an input filename, read it and create 
# the output files as indicated by the contents
# of the text in the file

split_collection() {
  in_file="$1"
  out_file=
  lines_in=0
  lines_out=0
  skipping=
  while read line ; do
    : $(( lines_in++ ))

    [[ $verbose_count > 1 ]] && show_line

    # if a line with the format of "[[foo]]" occurs,
    # close the current output file, and open a new
    # output file called "foo"

    if [[ "$line" =~ ^\[\[[[:blank:]]*([^ ]+.*[^ ]|[^ ])[[:blank:]]*\]\][[:blank:]]*$ ]] ; then
      new_file="${BASH_REMATCH[1]}"

      # close out the current file, if any
      if [[ "$out_file" ]]; then
        nrtalkf "%d lines written to %s\n" $lines_out "$out_file"
      fi

      # check the filename for bogosities
      case "$new_file" in 
        *..*|*/*) 
          [[ $verbose_count < 2 ]] && show_line
          error "Badly formatted filename"
          ;;
      esac

      out_file="$out_folder/$new_file"
      if [[ -e "$out_file" ]]; then
        if [[ -n "$overwrite" ]]; then
          nrtalk "Overwriting existing '$out_file'"
          qrun "cat /dev/null >'$out_file'"
        else
          error "$out_file already exists."
        fi
      else
        nrtalk "Creating new output file: '$out_file' ..."
        qrun "touch '$out_file'"
      fi

      lines_out=0
    elif [[ -z "$out_file" ]]; then

      # apparently, there are text lines before the filename
      # header; ignore them (out loud)
      if [[ ! "$skipping" ]]; then
        talk "Text preceding first filename ignored.."
        skipping=1
      fi

    else # next line of input for the file
      qrun "echo \"$line\" >>'$out_file'"
      : $(( lines_out++ ))
    fi
  done
}

norun=
verbose=
verbose_count=0
overwrite=
quiet=

while getopts 'hnoqv' opt ; do
  case "$opt" in
  h)  usage ;;
  n)  norun=1 ;;
  o)  overwrite=1 ;;
  q)  quiet=1 ;;
  v)  verbose=1 ; : $(( verbose_count++ )) ;;
  esac
done
shift $(( OPTIND - 1 ))

in_folder="${1:?Missing IN-FOLDER; see $PROG -h for details}"
out_folder="${2:?Missing OUT-FOLDER; see $PROG -h for details}"

# validate the input and output folders
#
# It might be reasonable to create the output folder for the 
# user, but that's left as an exercise for the user.

in_folder="${in_folder%/}"    # remove trailing slash, if any
out_folder="${out_folder%/}"

[[ -e "$in_folder" ]]  || error "$in_folder does not exist" 
[[ -d "$in_folder" ]]  || error "$in_folder is not a directory."
[[ -e "$out_folder" ]] || error "$out_folder does not exist."
[[ -d "$out_folder" ]] || error "$out_folder is not a directory."

for collection in $in_folder/* ; do
  talk "Reading $collection .."
  split_collection "$collection" <$collection 
done

exit

【讨论】：