【发布时间】:2017-06-05 22:26:07
【问题描述】:
第一次使用 hdf5,你能帮我找出问题所在,为什么添加 3d numpy 数组很慢。 预处理需要 3s,添加 3d numpy 数组 (100x512x512) 30s 并随着每个样本上升
首先我使用以下命令创建 hdf:
def create_h5(fname_):
"""
Run only once
to create h5 file for dicom images
"""
f = h5py.File(fname_, 'w', libver='latest')
dtype_ = h5py.special_dtype(vlen=bytes)
num_samples_train = 1397
num_samples_test = 1595 - 1397
num_slices = 100
f.create_dataset('X_train', (num_samples_train, num_slices, 512, 512),
dtype=np.int16, maxshape=(None, None, 512, 512),
chunks=True, compression="gzip", compression_opts=4)
f.create_dataset('y_train', (num_samples_train,), dtype=np.int16,
maxshape=(None, ), chunks=True, compression="gzip", compression_opts=4)
f.create_dataset('i_train', (num_samples_train,), dtype=dtype_,
maxshape=(None, ), chunks=True, compression="gzip", compression_opts=4)
f.create_dataset('X_test', (num_samples_test, num_slices, 512, 512),
dtype=np.int16, maxshape=(None, None, 512, 512), chunks=True,
compression="gzip", compression_opts=4)
f.create_dataset('y_test', (num_samples_test,), dtype=np.int16, maxshape=(None, ), chunks=True,
compression="gzip", compression_opts=4)
f.create_dataset('i_test', (num_samples_test,), dtype=dtype_,
maxshape=(None, ),
chunks=True, compression="gzip", compression_opts=4)
f.flush()
f.close()
print('HDF5 file created')
然后我运行代码更新 hdf 文件:
num_samples_train = 1397
num_samples_test = 1595 - 1397
lbl = pd.read_csv(lbl_fldr + 'stage1_labels.csv')
patients = os.listdir(dicom_fldr)
patients.sort()
f = h5py.File(h5_fname, 'a') #r+ tried
train_counter = -1
test_counter = -1
for sample in range(0, len(patients)):
sw_start = time.time()
pat_id = patients[sample]
print('id: %s sample: %d \t train_counter: %d test_counter: %d' %(pat_id, sample, train_counter+1, test_counter+1), flush=True)
sw_1 = time.time()
patient = load_scan(dicom_fldr + patients[sample])
patient_pixels = get_pixels_hu(patient)
patient_pixels = select_slices(patient_pixels)
if patient_pixels.shape[0] != 100:
raise ValueError('Slices != 100: ', patient_pixels.shape[0])
row = lbl.loc[lbl['id'] == pat_id]
if row.shape[0] > 1:
raise ValueError('Found duplicate ids: ', row.shape[0])
print('Time preprocessing: %0.2f' %(time.time() - sw_1), flush=True)
sw_2 = time.time()
#found test patient
if row.shape[0] == 0:
test_counter += 1
f['X_test'][test_counter] = patient_pixels
f['i_test'][test_counter] = pat_id
f['y_test'][test_counter] = -1
#found train
else:
train_counter += 1
f['X_train'][train_counter] = patient_pixels
f['i_train'][train_counter] = pat_id
f['y_train'][train_counter] = row.cancer
print('Time saving: %0.2f' %(time.time() - sw_2), flush=True)
sw_el = time.time() - sw_start
sw_rem = sw_el* (len(patients) - sample)
print('Elapsed: %0.2fs \t rem: %0.2fm %0.2fh ' %(sw_el, sw_rem/60, sw_rem/3600), flush=True)
f.flush()
f.close()
【问题讨论】:
-
因此,您将获取 1500 个患者文件,并将它们收集到一个 HDF5 文件中,并在此过程中使用分块和压缩。我将从这些文件的一个子集开始,探索各种 HDF5 设置的效果(关于块、压缩)。每个 3d 数组是 52MB,对吧?将它们放在单独的数据集中而不是 4d 数组中会有什么不同吗?