为什么在 Pycaffe 中使用自定义 python 层进行训练非常慢？答案

【问题标题】：Why is training using custom python layer in Pycaffe is extremely slow?为什么在 Pycaffe 中使用自定义 python 层进行训练非常慢？
【发布时间】：2018-07-20 16:36:34
【问题描述】：

我在 python 中创建了一个自定义层，以便我可以直接提供数据。
但我注意到它运行速度极慢，GPU 使用率最多为 1%（内存已分配，即我可以看到，当我运行脚本时，它分配 2100MB VRAM 并终止训练，释放大约 1G。
我不确定这是预期行为还是我做错了什么。
这是我写的脚本（基于this former pr）：

import json
import caffe
import numpy as np
from random import shuffle
from PIL import Image


class MyDataLayer(caffe.Layer):

    """
    This is a simple datalayer for training a network on CIFAR10.
    """

    def setup(self, bottom, top):

        self.top_names = ['data', 'label']

        # === Read input parameters ===
        params = eval(self.param_str)

        # Check the paramameters for validity.
        check_params(params)

        # store input as class variables
        self.batch_size = params['batch_size']

        # Create a batch loader to load the images.
        self.batch_loader = BatchLoader(params, None)

        # === reshape tops ===
        # since we use a fixed input image size, we can shape the data layer
        # once. Else, we'd have to do it in the reshape call.
        top[0].reshape(self.batch_size, 3, params['im_height'], params['im_width'])
        # this is for our label, since we only have one label we set this to 1
        top[1].reshape(self.batch_size, 1)

        print_info("MyDataLayer", params)

    def forward(self, bottom, top):
        """
        Load data.
        """
        for itt in range(self.batch_size):
            # Use the batch loader to load the next image.
            im, label = self.batch_loader.load_next_image()

            # Add directly to the caffe data layer
            top[0].data[itt, ...] = im
            top[1].data[itt, ...] = label

    def reshape(self, bottom, top):
        """
        There is no need to reshape the data, since the input is of fixed size
        (rows and columns)
        """
        pass

    def backward(self, top, propagate_down, bottom):
        """
        These layers does not back propagate
        """
        pass


class BatchLoader(object):

    """
    This class abstracts away the loading of images.
    Images can either be loaded singly, or in a batch. The latter is used for
    the asyncronous data layer to preload batches while other processing is
    performed.

    labels:
    the format is like : 
    png_data_batch_1/leptodactylus_pentadactylus_s_000004.png 6
    png_data_batch_1/camion_s_000148.png 9
    png_data_batch_1/tipper_truck_s_001250.png 9
    """

    def __init__(self, params, result):
        self.result = result
        self.batch_size = params['batch_size']
        self.image_root = params['image_root']
        self.im_shape = [params['im_height'],params['im_width']]

        # get list of images and their labels.
        self.image_labels = params['label']
        #getting the list of all image filenames along with their labels
        self.imagelist = [line.rstrip('\n\r') for line in open(self.image_labels)]
        self._cur = 0  # current image
        # this class does some simple data-manipulations
        self.transformer = SimpleTransformer()

        print ("BatchLoader initialized with {} images".format(len(self.imagelist)))

    def load_next_image(self):
        """
        Load the next image in a batch.
        """
        # Did we finish an epoch?
        if self._cur == len(self.imagelist):
            self._cur = 0
            shuffle(self.imagelist)

        # Load an image
        image_and_label = self.imagelist[self._cur]  # Get the image index
        #read the image filename
        image_file_name = image_and_label[0:-1]
        #load the image
        im = np.asarray(Image.open(self.image_root +'/'+image_file_name))
        #im = scipy.misc.imresize(im, self.im_shape)  # resize

        # do a simple horizontal flip as data augmentation
        flip = np.random.choice(2)*2-1
        im = im[:, ::flip, :]

        # Load and prepare ground truth

        #read the label
        label = image_and_label[-1]
        #convert to onehot encoded vector
        #fix: caffe automatically converts the label into one hot encoded vector. so we only need to simply use the decimal number (i.e. the plain label number)
        #one_hot_label = np.eye(10)[label]

        self._cur += 1
        return self.transformer.preprocess(im), label


def check_params(params):
    """
    A utility function to check the parameters for the data layers.
    """
    required = ['batch_size', 'image_root', 'im_width', 'im_height', 'label']
    for r in required:
        assert r in params.keys(), 'Params must include {}'.format(r)


def print_info(name, params):
    """
    Ouput some info regarding the class
    """
    print ("{} initialized for split: {}, with bs: {}, im_shape: {}.".format(
        name,
        params['image_root'],
        params['batch_size'],
        params['im_height'],
        params['im_width'],
        params['label']))


class SimpleTransformer:

    """
    SimpleTransformer is a simple class for preprocessing and deprocessing
    images for caffe.
    """

    def __init__(self, mean=[125.30, 123.05, 114.06]):
        self.mean = np.array(mean, dtype=np.float32)
        self.scale = 1.0

    def set_mean(self, mean):
        """
        Set the mean to subtract for centering the data.
        """
        self.mean = mean

    def set_scale(self, scale):
        """
        Set the data scaling.
        """
        self.scale = scale

    def preprocess(self, im):
        """
        preprocess() emulate the pre-processing occuring in the vgg16 caffe
        prototxt.
        """

        im = np.float32(im)
        im = im[:, :, ::-1]  # change to BGR
        im -= self.mean
        im *= self.scale
        im = im.transpose((2, 0, 1))

        return im

    def deprocess(self, im):
        """
        inverse of preprocess()
        """
        im = im.transpose(1, 2, 0)
        im /= self.scale
        im += self.mean
        im = im[:, :, ::-1]  # change to RGB

        return np.uint8(im)

在我的train_test.prototxt 文件中，我有：

name: "CIFAR10_SimpleTest_PythonLayer"
layer {
  name: 'MyPythonLayer'
  type: 'Python'
  top: 'data'
  top: 'label'
  include {
    phase: TRAIN
   }
  python_param {
    #the python script filename
    module: 'mypythonlayer'
    #the class name
    layer: 'MyDataLayer'
    #needed parameters in json
    param_str: '{"phase":"TRAIN", "batch_size":10, "im_height":32, "im_width":32, "image_root": "G:/Caffe/examples/cifar10/testbed/Train and Test using Pycaffe", "label": "G:/Caffe/examples/cifar10/testbed/Train and Test using Pycaffe/train_cifar10.txt"}'
  }
}

layer {
  name: 'MyPythonLayer'
  type: 'Python'
  top: 'data'
  top: 'label'
  include {
    phase: TEST
   }
  python_param {
    #the python script filename
    module: 'mypythonlayer'
    #the class name
    layer: 'MyDataLayer'
    #needed parameters in json
    param_str: '{"phase":"TEST", "batch_size":10, "im_height":32, "im_width":32, "image_root": "G:/Caffe/examples/cifar10/testbed/Train and Test using Pycaffe", "label": "G:/Caffe/examples/cifar10/testbed/Train and Test using Pycaffe/test_cifar10.txt"}'
  }
}

这里有什么问题吗？

【问题讨论】：

标签： machine-learning neural-network deep-learning caffe pycaffe

【解决方案1】：

您的数据层效率不够高，并且需要大部分训练时间（您应该尝试caffe time ... 以获得更详细的分析）。在每个forward 传递中，您都在等待python 层一个接一个地从磁盘读取batch_size 图像。这可能需要永远。您应该考虑使用 Multiprocessing 在网络处理前一批时在后台执行读取：这应该会给您带来良好的 CPU/GPU 利用率。
多处理python数据层见this example。

【讨论】：

谢谢，我试着看看前向传递需要多少时间，这是 5 次迭代的结果：FW took: 0.0030002593994140625 FW took: 0.003001689910888672 FW took: 0.0030007362365722656 FW took: 0.0030069351196289062 FW took: 0.002504110336303711我在前向循环之前和之后使用了 time.time()。我想他们很好，我不应该看到如此巨大的性能损失！

【解决方案2】：

Python 层是在 CPU 而不是 GPU 上执行的，因此速度很慢，因为训练时必须在 CPU 和 GPU 之间不断进行。这也是你看到 gpu 使用率低的原因，因为它在等待 cpu 执行 python 层。

【讨论】：

这里不是这样：python 层是输入层，GPU 到 CPU 没有同步。在大多数输入层中，数据通过 CPU/主机/磁盘读取，然后才同步到 GPU 内存以进行进一步处理。