【发布时间】:2010-10-07 11:39:59
【问题描述】:
可以拆分文件吗?例如,你有一个巨大的词表,我想把它拆分成一个以上的文件。这怎么可能?
【问题讨论】:
-
这当然是可能的。如果您想要有用的答案,您可能需要提供一些有用的详细信息。
-
你想用python来做吗?这个文件的结构如何?是文本文件吗?
标签: python
可以拆分文件吗?例如,你有一个巨大的词表,我想把它拆分成一个以上的文件。这怎么可能?
【问题讨论】:
标签: python
这个用换行符分割文件并将其写回。您可以轻松更改分隔符。如果您的输入文件中没有多个 splitLen 行(在本例中为 20 行),这也可以处理不均匀的数量。
splitLen = 20 # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.
# This is shorthand and not friendly with memory
# on very large files (Sean Cavanagh), but it works.
input = open('input.txt', 'r').read().split('\n')
at = 1
for lines in range(0, len(input), splitLen):
# First, get the list slice
outputData = input[lines:lines+splitLen]
# Now open the output file, join the new slice with newlines
# and write it out. Then close the file.
output = open(outputBase + str(at) + '.txt', 'w')
output.write('\n'.join(outputData))
output.close()
# Increment the counter
at += 1
【讨论】:
sli 示例的更好循环,不占用内存:
splitLen = 20 # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.
input = open('input.txt', 'r')
count = 0
at = 0
dest = None
for line in input:
if count % splitLen == 0:
if dest: dest.close()
dest = open(outputBase + str(at) + '.txt', 'w')
at += 1
dest.write(line)
count += 1
【讨论】:
将二进制文件拆分为 .000、.001 等章节的解决方案:
FILE = 'scons-conversion.7z'
MAX = 500*1024*1024 # 500Mb - max chapter size
BUF = 50*1024*1024*1024 # 50GB - memory buffer size
chapters = 0
uglybuf = ''
with open(FILE, 'rb') as src:
while True:
tgt = open(FILE + '.%03d' % chapters, 'wb')
written = 0
while written < MAX:
if len(uglybuf) > 0:
tgt.write(uglybuf)
tgt.write(src.read(min(BUF, MAX - written)))
written += min(BUF, MAX - written)
uglybuf = src.read(1)
if len(uglybuf) == 0:
break
tgt.close()
if len(uglybuf) == 0:
break
chapters += 1
【讨论】:
def split_file(file, prefix, max_size, buffer=1024):
"""
file: the input file
prefix: prefix of the output files that will be created
max_size: maximum size of each created file in bytes
buffer: buffer size in bytes
Returns the number of parts created.
"""
with open(file, 'r+b') as src:
suffix = 0
while True:
with open(prefix + '.%s' % suffix, 'w+b') as tgt:
written = 0
while written < max_size:
data = src.read(buffer)
if data:
tgt.write(data)
written += buffer
else:
return suffix
suffix += 1
def cat_files(infiles, outfile, buffer=1024):
"""
infiles: a list of files
outfile: the file that will be created
buffer: buffer size in bytes
"""
with open(outfile, 'w+b') as tgt:
for infile in sorted(infiles):
with open(infile, 'r+b') as src:
while True:
data = src.read(buffer)
if data:
tgt.write(data)
else:
break
【讨论】:
max_size 是1024 的整数倍,则会出现错误。written <= max_size 应该是written < max_size。我无法编辑它,因为它只是删除了一个字符。
当然可以:
open input file
open output file 1
count = 0
for each line in file:
write to output file
count = count + 1
if count > maxlines:
close output file
open next output file
count = 0
【讨论】:
import re
PATENTS = 'patent.data'
def split_file(filename):
# Open file to read
with open(filename, "r") as r:
# Counter
n=0
# Start reading file line by line
for i, line in enumerate(r):
# If line match with teplate -- <?xml --increase counter n
if re.match(r'\<\?xml', line):
n+=1
# This "if" can be deleted, without it will start naming from 1
# or you can keep it. It depends where is "re" will find at
# first time the template. In my case it was first line
if i == 0:
n = 0
# Write lines to file
with open("{}-{}".format(PATENTS, n), "a") as f:
f.write(line)
split_file(PATENTS)
结果你会得到:
专利数据-0
专利数据-1
专利数据-N
【讨论】:
您可以使用这个 pypi filesplit 模块。
【讨论】:
这是一个迟到的答案,但这里链接了一个新问题,并且没有提到任何答案itertools.groupby。
假设您有一个(巨大的)文件file.txt,您希望将其拆分为MAXLINES 行file_part1.txt、...、file_partn.txt 的块,您可以这样做:
with open(file.txt) as fdin:
for i, sub in itertools.groupby(enumerate(fdin), lambda x: 1 + x[0]//3):
fdout = open("file_part{}.txt".format(i))
for _, line in sub:
fdout.write(line)
【讨论】:
所有提供的答案都很好并且(可能)有效但是,他们需要将文件加载到内存中(全部或部分)。我们知道 Python 在这类任务中效率不高(或者至少不如操作系统级别的命令高效)。
我发现以下是最有效的方法:
import os
MAX_NUM_LINES = 1000
FILE_NAME = "input_file.txt"
SPLIT_PARAM = "-d"
PREFIX = "__"
if os.system(f"split -l {MAX_NUM_LINES} {SPLIT_PARAM} {FILE_NAME} {PREFIX}") == 0:
print("Done:")
print(os.system(f"ls {PREFIX}??"))
else:
print("Failed!")
在此处阅读有关split 的更多信息:https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/
【讨论】: