【问题标题】:Splitting all txt files in a folder into smaller files based on a regular expression using bash使用 bash 根据正则表达式将文件夹中的所有 txt 文件拆分为较小的文件
【发布时间】:2013-04-22 18:05:23
【问题描述】:

我有一个包含大文本文件的文件夹。每个文件是由 [[ file name ]] 分隔的 1000 个文件的集合。我想拆分文件并从中制作 1000 个文件并将它们放在一个新文件夹中。 bash有没有办法做到这一点?任何其他快速方法也可以。

for f in $(find . -name '*.txt')
do mkdir $f
  mv 
  cd $f
  awk '/[[.*]]/{g++} { print $0 > g".txt"}' $f
  cd ..
done 

【问题讨论】:

  • 是的,有办法做到这一点。你自己尝试过什么?
  • for f in $(find .-name '*.txt');做 mkdir $f; mv ;cd $f;awk '/[[.*]]/{g++} { print $0 > g".txt"}' $f;cd ..;完成
  • 但它给出了一个错误,文件名已在文件夹中使用

标签: regex bash text split


【解决方案1】:

您正在尝试创建一个与现有文件同名的文件夹。

for f in $(find . -name '*.txt')
do mkdir $f

在这里,“查找”将列出当前路径中的文件,并且您将尝试为每个文件创建一个名称完全相同的目录。一种方法是首先创建一个临时文件夹:

for f in $(find . -name '*.txt')
do mkdir temporary # create a temporary folder
  mv $f temporary # move the file into the folder
  mv temporary $f # rename the temporary folder to the name of the file
  cd $f # enter the folder and go on....
  awk '/[[.*]]/{g++} { print $0 > g".txt"}' $f
  cd ..
done 

请注意,您的所有文件夹都将具有“.txt”扩展名。如果你不想这样,你可以在创建文件夹之前把它剪掉;这样,您将不需要临时文件夹,因为您尝试创建的文件夹与 .txt 文件的名称不同。 示例:

for f in $(find . -name '*.txt' | rev | cut -b 5- | rev)

【讨论】:

    【解决方案2】:

    虽然不是awk和醉汉写的,但不保证能工作。

    import re
    import sys
    
    
    def main():
        pattern = re.compile(r'\[\[(.+)]]')
        with open (sys.argv[1]) as f:
            for line in f:
                m = re.search(pattern, line)
                if m:
                    try:
                        with open(fname, 'w+') as g:
                            g.writelines(lines)
                    except NameError:
                        pass
                    fname = m.group(1)
                    lines = []
                else:
                    lines.append(line)
    
        with open(fname, 'w+') as g:
            g.writelines(lines)
    
    if __name__ == '__main__':
        main()
    

    【讨论】:

      【解决方案3】:

      编写一个 bash 脚本。在这里,我已经为你完成了。

      注意这个脚本的结构和特点:

      • 解释它在usage() 函数中的作用,该函数用于-h 选项。
      • 提供一组标准选项:-h-n-v
      • 使用getopts进行选项处理
      • 对参数进行大量错误检查
      • 注意文件名解析(注意文件名周围的空格会被忽略。
      • 在函数中隐藏细节。注意到“talk”、“qtalk”、“nvtalk”功能了吗?这些来自我构建的 bash 库,以使这种脚本编写变得容易。
      • 向用户解释在$verbose 模式下会发生什么。
      • 让用户能够在不实际执行的情况下查看会执行的操作(-n 选项,用于$norun 模式)。
      • 永远不要直接运行命令。但使用run函数,注意$norun$verbose$quiet变量。

      我不只是为你钓鱼,而是教你如何钓鱼。

      祝你的下一个 bash 脚本好运。

      艾伦·S.

      #!/bin/bash
      # split-collections IN-FOLDER OUT-FOLDER
      
      PROG="${0##*/}"
      
      usage() {
        cat 1>&2 <<EOF
      usage: $PROG [OPTIONS] IN-FOLDER OUT-FOLDER
      
      This script splits a collection of files within IN-FOLDER into
      separate, named files into the given OUT-FOLDER.  The created file
      names are obtained from formatted text headers within the input
      files.
      
      The format of each input file is a set of HEADER and BODY pairs,
      where each HEADER is a text line formatted as:
      
          [[input-filename1]]
          text line 1
          text line 2
          ...
          [[input-filename2]]
          text line 1
          text line 2
          ...
      
      Normal processing will show the filenames being read, and file
      names being created.  Use the -v (verbose) option to show the
      number of text lines being written to each created file.  Use
      -v twice to show the actual lines of text being written.
      
      Use the -n option to show what would be done, without actually
      doing it.
      
      Options
       -h       Show this help
       -n       Dry run -- do NOT create any files or make any changes
       -o       Overwrite existing output files.
       -v       Be verbose
      
      EOF
         exit
      }
      
      talk()   { echo 1>&2 "$@" ; }
      chat()   { [[ -n "$norun$verbose" ]] && talk "$@" ; }
      nvtalk() { [[ -n "$verbose" ]] || talk "$@" ; }
      qtalk()  { [[ -n "$quiet" ]]   || talk "$@" ; }
      nrtalk() { talk "${norun:+(norun) }$@" ; }
      
      error() { 
        local code=2
        case "$1" in [0-9]*) code=$1 ; shift ;; esac
        echo 1>&2 "$@"
        exit $code
      }
      
      talkf()   { printf 1>&2 "$@" ; }
      chatf()   { [[ -n "$norun$verbose" ]] && talkf "$@" ; }
      nvtalkf() { [[ -n "$verbose" ]] || talkf "$@" ; }
      qtalkf()  { [[ -n "$quiet" ]]   || talkf "$@" ; }
      nrtalkf() { talkf "${norun:+(norun) }$@" ; }
      
      errorf()  { 
        local code=2
        case "$1" in [0-9]*) code=$1 ; shift ;; esac
        printf 1>&2 "$@"
        exit $code
      }
      
      # run COMMAND ARGS ...
      
      qrun() {
        ( quiet=1 run "$@" )
      }
      
      run() {
        if [[ -n "$norun" ]]; then
          if [[ -z "$quiet" ]]; then
            nrtalk "$@"
          fi
        else
          if [[ -n "$verbose" ]]; then
            talk ">> $@"
          fi
          if ! eval "$@" ; then
            local code=$?
            return $code
          fi
        fi
        return 0
      }
      
      show_line() {
        talkf "%s:%d: %s\n" "$in_file" "$lines_in" "$line"
      }
      
      # given an input filename, read it and create 
      # the output files as indicated by the contents
      # of the text in the file
      
      split_collection() {
        in_file="$1"
        out_file=
        lines_in=0
        lines_out=0
        skipping=
        while read line ; do
          : $(( lines_in++ ))
      
          [[ $verbose_count > 1 ]] && show_line
      
          # if a line with the format of "[[foo]]" occurs,
          # close the current output file, and open a new
          # output file called "foo"
      
          if [[ "$line" =~ ^\[\[[[:blank:]]*([^ ]+.*[^ ]|[^ ])[[:blank:]]*\]\][[:blank:]]*$ ]] ; then
            new_file="${BASH_REMATCH[1]}"
      
            # close out the current file, if any
            if [[ "$out_file" ]]; then
              nrtalkf "%d lines written to %s\n" $lines_out "$out_file"
            fi
      
            # check the filename for bogosities
            case "$new_file" in 
              *..*|*/*) 
                [[ $verbose_count < 2 ]] && show_line
                error "Badly formatted filename"
                ;;
            esac
      
            out_file="$out_folder/$new_file"
            if [[ -e "$out_file" ]]; then
              if [[ -n "$overwrite" ]]; then
                nrtalk "Overwriting existing '$out_file'"
                qrun "cat /dev/null >'$out_file'"
              else
                error "$out_file already exists."
              fi
            else
              nrtalk "Creating new output file: '$out_file' ..."
              qrun "touch '$out_file'"
            fi
      
            lines_out=0
          elif [[ -z "$out_file" ]]; then
      
            # apparently, there are text lines before the filename
            # header; ignore them (out loud)
            if [[ ! "$skipping" ]]; then
              talk "Text preceding first filename ignored.."
              skipping=1
            fi
      
          else # next line of input for the file
            qrun "echo \"$line\" >>'$out_file'"
            : $(( lines_out++ ))
          fi
        done
      }
      
      norun=
      verbose=
      verbose_count=0
      overwrite=
      quiet=
      
      while getopts 'hnoqv' opt ; do
        case "$opt" in
        h)  usage ;;
        n)  norun=1 ;;
        o)  overwrite=1 ;;
        q)  quiet=1 ;;
        v)  verbose=1 ; : $(( verbose_count++ )) ;;
        esac
      done
      shift $(( OPTIND - 1 ))
      
      in_folder="${1:?Missing IN-FOLDER; see $PROG -h for details}"
      out_folder="${2:?Missing OUT-FOLDER; see $PROG -h for details}"
      
      # validate the input and output folders
      #
      # It might be reasonable to create the output folder for the 
      # user, but that's left as an exercise for the user.
      
      in_folder="${in_folder%/}"    # remove trailing slash, if any
      out_folder="${out_folder%/}"
      
      [[ -e "$in_folder" ]]  || error "$in_folder does not exist" 
      [[ -d "$in_folder" ]]  || error "$in_folder is not a directory."
      [[ -e "$out_folder" ]] || error "$out_folder does not exist."
      [[ -d "$out_folder" ]] || error "$out_folder is not a directory."
      
      for collection in $in_folder/* ; do
        talk "Reading $collection .."
        split_collection "$collection" <$collection 
      done
      
      exit
      

      【讨论】:

        猜你喜欢
        • 2011-06-24
        • 2017-06-24
        • 2017-03-09
        • 1970-01-01
        • 2013-04-12
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多