【问题标题】:How can I split a file in python?如何在python中拆分文件?
【发布时间】:2010-10-07 11:39:59
【问题描述】:

可以拆分文件吗?例如,你有一个巨大的词表,我想把它拆分成一个以上的文件。这怎么可能?

【问题讨论】:

标签: python


【解决方案1】:

这个用换行符分割文件并将其写回。您可以轻松更改分隔符。如果您的输入文件中没有多个 splitLen 行(在本例中为 20 行),这也可以处理不均匀的数量。

splitLen = 20         # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.

# This is shorthand and not friendly with memory
# on very large files (Sean Cavanagh), but it works.
input = open('input.txt', 'r').read().split('\n')

at = 1
for lines in range(0, len(input), splitLen):
    # First, get the list slice
    outputData = input[lines:lines+splitLen]

    # Now open the output file, join the new slice with newlines
    # and write it out. Then close the file.
    output = open(outputBase + str(at) + '.txt', 'w')
    output.write('\n'.join(outputData))
    output.close()

    # Increment the counter
    at += 1

【讨论】:

  • 可能会提到对于非常大的文件,open().read() 会占用大量内存和时间。但大多数情况下没关系。
  • 哦,我知道了。我只是想快速组合一个工作脚本,而且我通常使用小文件。我最终得到了这样的速记。
  • 这个方法其实很快。我使用 1.5GB 内存在 28 秒内将 1GB 文件与 7M 行拆分。与此相比:stackoverflow.com/questions/20602869/… 更快。
【解决方案2】:

sli 示例的更好循环,不占用内存:

splitLen = 20         # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.

input = open('input.txt', 'r')

count = 0
at = 0
dest = None
for line in input:
    if count % splitLen == 0:
        if dest: dest.close()
        dest = open(outputBase + str(at) + '.txt', 'w')
        at += 1
    dest.write(line)
    count += 1

【讨论】:

  • 复制这段代码时要小心!它为 dest 和 input 留下打开的文件句柄。此外,覆盖内置方法“输入”不是一个好主意
【解决方案3】:

将二进制文件拆分为 .000、.001 等章节的解决方案:

FILE = 'scons-conversion.7z'

MAX  = 500*1024*1024  # 500Mb  - max chapter size
BUF  = 50*1024*1024*1024  # 50GB   - memory buffer size

chapters = 0
uglybuf  = ''
with open(FILE, 'rb') as src:
  while True:
    tgt = open(FILE + '.%03d' % chapters, 'wb')
    written = 0
    while written < MAX:
      if len(uglybuf) > 0:
        tgt.write(uglybuf)
      tgt.write(src.read(min(BUF, MAX - written)))
      written += min(BUF, MAX - written)
      uglybuf = src.read(1)
      if len(uglybuf) == 0:
        break
    tgt.close()
    if len(uglybuf) == 0:
      break
    chapters += 1

【讨论】:

    【解决方案4】:
    def split_file(file, prefix, max_size, buffer=1024):
        """
        file: the input file
        prefix: prefix of the output files that will be created
        max_size: maximum size of each created file in bytes
        buffer: buffer size in bytes
    
        Returns the number of parts created.
        """
        with open(file, 'r+b') as src:
            suffix = 0
            while True:
                with open(prefix + '.%s' % suffix, 'w+b') as tgt:
                    written = 0
                    while written < max_size:
                        data = src.read(buffer)
                        if data:
                            tgt.write(data)
                            written += buffer
                        else:
                            return suffix
                    suffix += 1
    
    
    def cat_files(infiles, outfile, buffer=1024):
        """
        infiles: a list of files
        outfile: the file that will be created
        buffer: buffer size in bytes
        """
        with open(outfile, 'w+b') as tgt:
            for infile in sorted(infiles):
                with open(infile, 'r+b') as src:
                    while True:
                        data = src.read(buffer)
                        if data:
                            tgt.write(data)
                        else:
                            break
    

    【讨论】:

    • 如果max_size 是1024 的整数倍,则会出现错误。written &lt;= max_size 应该是written &lt; max_size。我无法编辑它,因为它只是删除了一个字符。
    • @osrpt 请注意,如果倒数第二个文件读取所有剩余字节(例如:如果您拆分一个文件分成两半,它会创建两个文件和一个零字节的第三个文件)。我想这个问题没有那么严重。
    【解决方案5】:

    当然可以:

    open input file
    open output file 1
    count = 0
    for each line in file:
        write to output file
        count = count + 1
        if count > maxlines:
             close output file
             open next output file
             count = 0
    

    【讨论】:

      【解决方案6】:
      import re
      PATENTS = 'patent.data'
      
      def split_file(filename):
          # Open file to read
          with open(filename, "r") as r:
      
              # Counter
              n=0
      
              # Start reading file line by line
              for i, line in enumerate(r):
      
                  # If line match with teplate -- <?xml --increase counter n
                  if re.match(r'\<\?xml', line):
                      n+=1
      
                      # This "if" can be deleted, without it will start naming from 1
                      # or you can keep it. It depends where is "re" will find at
                      # first time the template. In my case it was first line
                      if i == 0:
                          n = 0               
      
                  # Write lines to file    
                  with open("{}-{}".format(PATENTS, n), "a") as f:
                      f.write(line)
      
      split_file(PATENTS)
      

      结果你会得到:

      专利数据-0

      专利数据-1

      专利数据-N

      【讨论】:

        【解决方案7】:

        您可以使用这个 pypi filesplit 模块。

        【讨论】:

          【解决方案8】:

          这是一个迟到的答案,但这里链接了一个新问题,并且没有提到任何答案itertools.groupby

          假设您有一个(巨大的)文件file.txt,您希望将其拆分为MAXLINESfile_part1.txt、...、file_partn.txt 的块,您可以这样做:

          with open(file.txt) as fdin:
              for i, sub in itertools.groupby(enumerate(fdin), lambda x: 1 + x[0]//3):
                  fdout = open("file_part{}.txt".format(i))
                  for _, line in sub:
                      fdout.write(line)
          

          【讨论】:

            【解决方案9】:

            所有提供的答案都很好并且(可能)有效但是,他们需要将文件加载到内存中(全部或部分)。我们知道 Python 在这类任务中效率不高(或者至少不如操作系统级别的命令高效)。

            我发现以下是最有效的方法:

            import os
            
            MAX_NUM_LINES = 1000
            FILE_NAME = "input_file.txt"
            SPLIT_PARAM = "-d"
            PREFIX = "__"
            
            if os.system(f"split -l {MAX_NUM_LINES} {SPLIT_PARAM} {FILE_NAME} {PREFIX}") == 0:
                print("Done:")
                print(os.system(f"ls {PREFIX}??"))
            else:
                print("Failed!")
            

            在此处阅读有关split 的更多信息:https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/

            【讨论】:

              猜你喜欢
              • 1970-01-01
              • 1970-01-01
              • 2013-11-30
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              相关资源
              最近更新 更多