【问题标题】:Python - how to read file with NUL delimited lines?Python - 如何读取带有 NUL 分隔行的文件?
【发布时间】:2012-03-03 11:32:42
【问题描述】:

我通常使用以下 Python 代码从文件中读取行:

f = open('./my.csv', 'r')
for line in f:
    print line

但是如果文件是由“\0”(不是“\n”)分隔的行呢?有没有可以处理这个的 Python 模块?

感谢您的建议。

【问题讨论】:

    标签: python nul


    【解决方案1】:

    如果您的文件足够小,可以将其全部读入内存,则可以使用 split:

    for line in f.read().split('\0'):
        print line
    

    否则你可能想从关于feature request的讨论中尝试这个食谱:

    def fileLineIter(inputFile,
                     inputNewline="\n",
                     outputNewline=None,
                     readSize=8192):
       """Like the normal file iter but you can set what string indicates newline.
       
       The newline string can be arbitrarily long; it need not be restricted to a
       single character. You can also set the read size and control whether or not
       the newline string is left on the end of the iterated lines.  Setting
       newline to '\0' is particularly good for use with an input file created with
       something like "os.popen('find -print0')".
       """
       if outputNewline is None: outputNewline = inputNewline
       partialLine = ''
       while True:
           charsJustRead = inputFile.read(readSize)
           if not charsJustRead: break
           partialLine += charsJustRead
           lines = partialLine.split(inputNewline)
           partialLine = lines.pop()
           for line in lines: yield line + outputNewline
       if partialLine: yield partialLine
    

    我还注意到您的文件具有“csv”扩展名。 Python 中内置了一个 CSV 模块(导入 csv)。有一个名为 Dialect.lineterminator 的属性,但它目前还没有在阅读器中实现:

    Dialect.lineterminator

    用于终止作者生成的行的字符串。它默认为 '\r\n'。

    注意阅读器被硬编码以将 '\r' 或 '\n' 识别为行尾,并忽略换行符。这种行为将来可能会改变。

    【讨论】:

    • 我的文件会有几千到几万行。
    • @user1129812:一行多长? 100 字节? 100 字节 * 50000 行 = 大约。 5MB
    • 每行大约 100 个字符。假设 unicode,每行大约 200 字节,一个 50000 行的文件大约有 200 x 50000 = 9.54 MB。
    • 从您的feature request 链接中,我认为这可能是当前 Python 的最佳解决方案。根据feature request 链接中的msg111453,通过awk 预处理文件以将NUL 字符更改为“\n”可能是一种替代解决方案(如果文件的内容中不包含“\n”)。谢谢。
    • fileLineIter 不太对:如果partialLine 中的最后一个字符是inputNewLine,则会因为'a|b|'.split('|') == ['a', 'b', ''] 而丢失。
    【解决方案2】:

    我已经修改了 Mark Byers 的建议,以便我们可以在 Python 中使用 NUL 分隔行读取文件。这种方法逐行读取一个可能很大的文件,并且应该更节省内存。这是 Python 代码(带有 cmets):

    import sys
    
    # Variables for "fileReadLine()"
    inputFile = sys.stdin   # The input file. Use "stdin" as an example for receiving data from pipe.
    lines = []   # Extracted complete lines (delimited with "inputNewline").
    partialLine = ''   # Extracted last non-complete partial line.
    inputNewline="\0"   # Newline character(s) in input file.
    outputNewline="\n"   # Newline character(s) in output lines.
    readSize=8192   # Size of read buffer.
    # End - Variables for "fileReadLine()"
    
    # This function reads NUL delimited lines sequentially and is memory efficient.
    def fileReadLine():
       """Like the normal file readline but you can set what string indicates newline.
    
       The newline string can be arbitrarily long; it need not be restricted to a
       single character. You can also set the read size and control whether or not
       the newline string is left on the end of the read lines.  Setting
       newline to '\0' is particularly good for use with an input file created with
       something like "os.popen('find -print0')".
       """
       # Declare that we want to use these related global variables.
       global inputFile, partialLine, lines, inputNewline, outputNewline, readSize
       if lines: 
           # If there is already extracted complete lines, pop 1st llne from lines and return that line + outputNewline.
           line = lines.pop(0)
           return line + outputNewline
       # If there is NO already extracted complete lines, try to read more from input file.
       while True:   # Here "lines" must be an empty list.
           charsJustRead = inputFile.read(readSize)   # The read buffer size, "readSize", could be changed as you like.
           if not charsJustRead:   
              # Have reached EOF. 
              if partialLine:
                 # If partialLine is not empty here, treat it as a complete line and copy and return it.
                 popedPartialLine = partialLine
                 partialLine = ""   # partialLine is now copied for return, reset it to an empty string to indicate that there is no more partialLine to return in later "fileReadLine" attempt.
                 return popedPartialLine   # This should be the last line of input file.
              else:
                 # If reached EOF and partialLine is empty, then all the lines in input file must have been read. Return None to indicate this.
                 return None
           partialLine += charsJustRead   # If read buffer is not empty, add it to partialLine.
           lines = partialLine.split(inputNewline)   # Split partialLine to get some complete lines.
           partialLine = lines.pop()   # The last item of lines may not be a complete line, move it to partialLine.
           if not lines:
              # Empty "lines" means that we must NOT have finished read any complete line. So continue.
              continue
           else:
              # We must have finished read at least 1 complete llne. So pop 1st llne from lines and return that line + outputNewline (exit while loop).
              line = lines.pop(0)
              return line + outputNewline
    
    
    # As an example, read NUL delimited lines from "stdin" and print them out (using "\n" to delimit output lines).
    while True:
        line = fileReadLine()
        if line is None: break
        sys.stdout.write(line)   # "write" does not include "\n".
        sys.stdout.flush() 
    

    希望对你有帮助。

    【讨论】:

      猜你喜欢
      • 2015-07-05
      • 1970-01-01
      • 2019-04-01
      • 1970-01-01
      • 2021-05-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-08-27
      相关资源
      最近更新 更多