如何配对另一个numpy数组中大致相同的矩阵答案

【问题标题】：How to pair matrices that are approximately the same in another numpy array如何配对另一个numpy数组中大致相同的矩阵
【发布时间】：2023-03-23 21:52:01
【问题描述】：

背景

我有以下代码，它就像一个魅力，用于为连体网络配对：

def make_pairs(images, labels):

# initialize two empty lists to hold the (image, image) pairs and
# labels to indicate if a pair is positive or negative
pairImages = []
pairLabels = []

# calculate the total number of classes present in the dataset
# and then build a list of indexes for each class label that
# provides the indexes for all examples with a given label
#np.unique function finds all unique class labels in our labels list. 
#Taking the len of the np.unique output yields the total number of unique class labels in the dataset. 
#In the case of the MNIST dataset, there are 10 unique class labels, corresponding to the digits 0-9.

numClasses = len(np.unique(labels))

#idxs have a list of indexes that belong to each class

idx = [np.where(labels == i)[0] for i in range(0, numClasses)]

#let’s now start generating our positive and negative pairs
for idxA in range(len(images)):
    
    # grab the current image and label belonging to the current
    # iteration
    currentImage = images[idxA]
    label = labels[idxA]
    
    # randomly pick an image that belongs to the *same* class
    # label
    idxB = np.random.choice(idx[label])
    posImage = images[idxB]
    
    # prepare a positive pair and update the images and labels
    # lists, respectively
    pairImages.append([currentImage, posImage])
    pairLabels.append([1])
    
    #grab the indices for each of the class labels *not* equal to
    #the current label and randomly pick an image corresponding
    #to a label *not* equal to the current label
    negIdx = np.where(labels != label)[0]
    negImage = images[np.random.choice(negIdx)]
    # prepare a negative pair of images and update our lists
    pairImages.append([currentImage, negImage])
    pairLabels.append([0])
#return a 2-tuple of our image pairs and labels
return (np.array(pairImages), np.array(pairLabels))

好的，此代码通过为 MNIST 数据集中的每个图像选择对来工作。它通过随机选择同一类（标签）的另一张图像和不同类（标签）的另一个补丁来制作另一对来为该图像构建一对。通过运行代码，返回的两个矩阵的最终形状为：

# load MNIST dataset and scale the pixel values to the range of [0, 1]
print("[INFO] loading MNIST dataset...")
(trainX, trainY), (testX, testY) = mnist.load_data()

# build the positive and negative image pairs
print("[INFO] preparing positive and negative pairs...")
(pairTrain, labelTrain) = make_pairs(trainX, trainY)
(pairTest, labelTest) = make_pairs(testX, testY)

>> print(pairTrain.shape)
(120000, 2, 28, 28)
>> print(labelTrain.shape)
(120000, 1)

我的数据集

我想对另一个数据集做一些不同的事情。假设我有另一个 5600 个 28x28x3 尺寸的 RGB 图像数据集，如下所示：

>>> images2.shape
(5600, 28, 28, 3)

我还有另一个数组，我们称之为 labels2，它有 8 个标签用于所有 5600 张图片，每个标签有 700 张图片，如下所示：

>>> labels2.shape
(5600,)

>>> len(np.unique(labels2))
8

>>> (labels2==0).sum()
700
>>> (labels2==1).sum()
700
>>> (labels2==2).sum()
700
...

我想做什么

我的数据集不是 MNIST 数据集，因此来自同一类的图像不太相似。我想通过以下方式构建大致相同的对：

对于我的数据集中的每张图片，我想要执行以下操作：

1.1。通过 MSE 计算该图像与数据集中所有其他图像之间的相似度。

1.2 对于与该图像具有相同标签的图像 MSE 集合，选择具有 7 个最小 MSE 的图像并构建 7 对，包含该图像加上 7 个最接近的 MSE 图像。这些对代表我的 Siamese Network 的同一类的图像。

1.3 对于具有不同标签的图像的 MSE 集合，从该图像中选择，对于每个不同的标签，只有一个具有最小 MSE 的图像。因此，由于有 7 个标签与该图像的标签不同，因此该图像还有 7 个对。

由于我的数据集有 5600 张 28x28x3 的图像，并且对于每张图像，我构建了 14 对（7 个相同类，7 个不同类）我希望有一个 pairTrain 矩阵尺寸 (78400, 2, 28, 28, 3)

我做了什么

我有以下代码完全符合我的要求：

def make_pairs(images, labels):

# initialize two empty lists to hold the (image, image) pairs and
# labels to indicate if a pair is positive or negative
pairImages = []
pairLabels = []


#In my dataset, there are 8 unique class labels, corresponding to the classes 0-7.
numClasses = len(np.unique(labels))

#Initial lists
pairLabels=[]
pairImages=[]

#let’s now start generating our positive and negative pairs for each image in the dataset
for idxA in range(len(images)):
        print("Image "+str(k)+ " out of " +str(len(images)))
        k=k+1  

        #For each image, I need to store the MSE between it and all the others
        mse_all=[]

        #Get each image and its label
        currentImage = images[idxA]
        label = labels[idxA]
        
        #Now we need to iterate through all the other images    
        for idxB in range(len(images)):
            candidateImage = images[idxB]
            #Calculate the mse and store all mses
            mse=np.mean(candidateImage - currentImage)**2
            mse_all.append(mse)
        
        mse_all=np.array(mse_all)

        #When we finished calculating mse between the currentImage ad all the others, 
        #let's add 7 pairs that have the smallest mse in the case of images from the 
        #same class and 1 pair for each different class 
        
        #For all classes, do                   
        for i in range(0,numClasses):

            #get indices of images for that class
            idxs=[np.where(labels == i)[0]] 
            
            #Get images of that class
            imgs=images[np.array(idxs)]
            imgs=np.squeeze(imgs, axis=0)
                
            #get MSEs between the currentImage and all the others of that class
            mse_that_class=mse_all[np.array(idxs)]
            mse_that_class=np.squeeze(mse_that_class, axis=0)
            
            #if the class is the same class of that image   
            if i==label:    
                #Get indices of that class that have the 7 smallest MSEs
                indices_sorted = np.argpartition(mse_that_class, numClasses-1)
            
            else:
                #Otherwise, get only the smallest MSE
                indices_sorted = np.argpartition(mse_that_class, 1)
            
            # Now, lets pair them
            for j in range(0,indices_sorted.shape[0]):

                image_to_pair=imgs[indices_sorted[j], :, :, :]
                pairImages.append([currentImage, image_to_pair])
                
                if i==label:
                    pairLabels.append([1])
                else:
                    pairLabels.append([0])
        del image_to_pair, currentImage, label, mse_that_class, imgs, indices_sorted, idxs, mse_all
return (np.array(pairImages), np.array(pairLabels))

我的问题

我的代码的问题在于，当我为图像编号 2200 运行对构造时，它只是冻结我的计算机，我尝试在每个循环之后清理变量，如上图所示代码（del image_to_pair、currentImage、label、mse_that_class、imgs、indices_sorted、idxs、mse_all）。问题是，一个 (120000, 2, 28, 28) pairImages 矩阵不难构建，但是一个 (78400,2,28,28,3) 是。所以：

这可能是内存问题吗？
我可以清理代码中的更多变量以使其正常工作吗？
我是否应该考虑我的 pairImages 矩阵的最后一个维度，使其具有比第一个示例更小的维度，从而可以工作？
有没有更简单的方法可以解决我的问题？

你可以找到函数代码和输入矩阵HERE

【问题讨论】：

你写了mse=np.mean(candidateImage - currentImage)**2;我假设mse 代表均方误差，在这种情况下它应该是mse=np.mean((candidateImage - currentImage)**2)。或者你可以使用from sklearn.metrics import mean_squared_error; mse = mean_squared_error(candidateImage,currentImage)。
我认为您可以通过预先保留的最后一层 imagenet 的特征向量计算相似度（例如：kneighbors（欧几里得距离））来解决它

标签： python arrays numpy memory siamese-network

【解决方案1】：

您可以尝试在每个循环开始时运行 gc.collect() 以主动运行垃圾收集器。在垃圾收集运行之前，Python 中的内存不会被释放。我不清楚您当前的 del 语句是否正在执行您想要的操作。（del 会减少引用计数，但它不一定会释放内存，您的代码实际上是在为其提供新创建的元组而不是变量）。

78400 * 2 * 28 * 28 * 3 = 368,793,600，乘以每条数据的大小（以字节为单位），这表明它应该是内存问题。我的猜测是，冻结是计算机试图从使用 RAM 切换到使用驱动器上的交换文件，并且像这样密集使用交换文件会导致任何计算机进行转储。

您的图像也应该通过生成器一次加载一张，而不是打包到一个数组中。

import gc
gc.collect()

filenames = ["a.jpg", "b.jpg"]
labels = ["a", "b"]

def image_loader(filenames):  # this is a generator, not a function
   # code to load image
   for f in filenames:
       gc.collect()  # make super sure we're freeing the memory from the last image
       image = load_or_something(filename)
       yield image

make_pairs(image_loader(filenames), labels)

生成器的功能与 for 循环和类似内容的列表完全相同，不同之处在于列表中的每个项目都是在现场生成而不是加载到内存中。（这有点技术性，但 tl;dr 它是一个列表制作工具，只能动态加载图像）。

【讨论】：

【解决方案2】：

我相信您可以使这部分代码更容易，这也应该有助于运行时间。

#Now we need to iterate through all the other images    
for idxB in range(len(images)):
    candidateImage = images[idxB]
    #Calculate the mse and store all mses
    mse=np.mean(candidateImage - currentImage)**2
    mse_all.append(mse)

您可以这样做并让 NumPy 进行广播，而不是使用 for 循环遍历您的数据

# assuming images is a numpy array with shape 5600,28,28,3
mse_all = np.mean( ((images - currentImage)**2).reshape(images.shape[0],-1), axis=1 )
# mse_all.shape 5600,

【讨论】：

【解决方案3】：

一些可能的问题和优化

除了尝试强制垃圾收集器释放未使用的内存（这似乎无法通过尝试解决您的问题），我认为您的代码还有其他问题，除非我不明白发生了什么。

通过查看以下sn-p：

#Agora adiciono os 7 mais parecidos com aquele bloco, sendo 7 da mesma e 1 de cada das outras classes. 1 bloco 
   for j in range(0,indices_sorted.shape[0]):

您似乎正在迭代 j 其中 indices_sorted.shape[0] 始终为 700，我不确定这是是你想要的。我认为您只需要 j

for i in range(0,numClasses):
    # ...
    # ...
    #print(indices_sorted.shape) # -> (700,)
    #print(indices_sorted.shape[0]) # --> always 700!so why range(0, 700?)
    #Agora adiciono os 7 mais parecidos com aquele bloco, sendo 7 da mesma e 1 de cada das outras classes. 1 bloco 
    for j in range(0, 7):
        image_to_pair=np.array(imgs[indices_sorted[j], :, :, :], dtype=np.uint8)
        pairImages.append([currentImage, image_to_pair])
        
        if i==label:
            pairLabels.append([1])
        else:
            pairLabels.append([0])
    del imgs, idxs
#When finished to find the pairs for that image, lets clean the trash     
del image_to_pair, currentImage, label, mse_that_class, indices_sorted, mse_all
gc.collect()

简单直观的提示

通过一些测试，我通过评论看到了这一点：

    pairImages.append([currentImage, image_to_pair])

您的内存占用几乎为 0。

其他说明

作为补充说明，我将 imgs 和 idxs 的 del 操作移到 i 中并且这里的主要改进似乎是通过强制正确的类型来获得的：

    image_to_pair=np.array(imgs[indices_sorted[j], :, :, :], dtype=np.uint8)

结果

根据我的测试，在原始代码中，k = 100、200 的内存使用量分别为 642351104 字节和 783355904 字节。因此，100 次迭代的内存使用量增加了 134.5 MB。应用上述修改后，我们有 345239552 B 和 362336256 B，仅增加了 16.5 MB。

【讨论】：