【发布时间】:2021-12-17 02:54:22
【问题描述】:
我正在编写一个代码来读取多达十亿行的研究数据。我必须逐行读取数据,因为数据有多个块。每个块都有与其他块头和数据集不同的头。 我希望将这些数据集读入 Numpy 矩阵,以便执行矩阵运算。以下是基本代码。
with open(datafile, "r") as dump:
i = 0 # block line number
line_no = 0 # total line number
block_size = 0
block_count = 0
for line in dump:
values = line.rstrip().rsplit()
i += 1
line_no += 1
if i <= self.head_line_no:
print(line) # for test
if self.tag_block in line or i == 1: # 1st line of a block
# save block size after reading 1st block
if block_size == 0 and block_count == 0:
block_size = line_no - 1
i = 1 # reset block line number
self.box = [] # reset box constant
print(self.matrix)
self.matrix = np.zeros((0, 0), dtype="float") # reset matrix
block_count += 1
elif i == 2:
self.timestamp.append(values[0])
elif i == 3 or i == 5:
continue
elif i == 4:
if self.atom_no != 0 and self.atom_no != values[0]:
self.warning_message = "atom number in timestep " + self.timestamp[-1] + "is inconsistent with" + self.timestamp[-2]
config.ConfigureUserEnv.log(self.warning_message)
else:
pass
self.atom_no = values[0]
elif i == 6 or i == 7 or i == 8:
self.box.append(values[0])
self.box.append(values[1])
elif i == self.head_line_no:
values = line.rstrip().rsplit(":")
for j in range(1,len(values)):
self.column_name.append(values[j])
else:
if self.matrix.size != 0:
np_array = np.array(values)
self.matrix = np.append(self.matrix, np.array(np.asarray(values)), 0)
else:
np_array = np.array(values)
self.matrix = np.zeros((1,len(values)), dtype="float")
self.matrix = np.asarray(values)
dump.close()
print(self.matrix) # for test
print(self.matrix.size) # for test
原始数据如下:
ITEM: TIMESTEP
100
ITEM: NUMBER OF ATOMS
17587
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e+00 4.3491000000000000e+01
0.0000000000000000e+00 4.3491000000000000e+01
0.0000000000000000e+00 1.2994000000000000e+02
ITEM: ATOMS id type q xs ys zs
59 1 1.80278 0.110598 0.129682 0.0359397
297 1 1.14132 0.139569 0.0496654 0.00692627
315 1 1.17041 0.0832356 0.00620818 0.00507927
509 1 1.67165 0.0420777 0.113817 0.0313991
590 1 1.65209 0.114966 0.0630015 0.0447129
731 1 1.65143 0.0501253 0.13658 0.0108512
1333 2 1.049 0.00850751 0.0526546 0.0406341
......
我希望添加如下矩阵数据:
matrix = [[59 1 1.80278 0.110598 0.129682 0.0359397],
[297 1 1.14132 0.139569 0.0496654 0.00692627],
[315 1 1.17041 0.0832356 0.00620818 0.00507927],
...]
如上所述,数据集的规模非常大。我希望使用最快的方式将数组追加到矩阵中。任何进一步的帮助和建议将不胜感激。
【问题讨论】:
标签: python numpy performance