查找附近重复和伪造的图像答案

【问题标题】：Find near duplicate and faked images查找附近重复和伪造的图像
【发布时间】：2022-11-11 03:46:03
【问题描述】：

我正在使用感知散列技术来查找近似重复和完全重复的图像。该代码非常适合查找完全相同的图像。然而，找到几乎重复的和稍微修改过的图像似乎很困难。由于它们的哈希值之间的差异分数通常类似于完全不同的随机图像的哈希差异。

为了解决这个问题，我尝试将近乎重复的图像的像素化减少到 50x50 像素并使它们成为黑/白，但我仍然没有我需要的东西（差异小）。

这是一个几乎重复的图像对的示例：

图片 1（a1.jpg）：

图片 2 (b1.jpg):

这些图像的哈希分数之间的差异是：24

当像素化（50x50 像素）时，它们看起来像这样：

rs_a1.jpg

rs_b1.jpg

像素化图像的哈希差异分数更大！ : 26

下面是@ann zen 要求的另外两个近乎重复的图像对示例：

对 1

对 2

我用来减小图像大小的代码是这样的：

from PIL import Image    
with Image.open(image_path) as image:
            reduced_image = image.resize((50, 50)).convert('RGB').convert("1")

以及比较两个图像哈希的代码：

from PIL import Image
import imagehash        
with Image.open(image1_path) as img1:
            hashing1 =  imagehash.phash(img1)
with Image.open(image2_path) as img2:
            hashing2 =  imagehash.phash(img2)           
print('difference :  ', hashing1-hashing2)

【问题讨论】：

通常此类任务是使用深度学习模型完成的。您有任何理由/限制使用这种“统计”方法吗？
@AbhinavMathur 我需要在 1000 万张图像的数据集中找到编辑/伪造/调整的图像。使用散列算法（例如 phash）很容易找到确切的重复项。但我找不到找到几乎重复/编辑过的方法
也许计算两个几乎相同的图像之间的互相关，它应该是比每像素散列更强大的相似性度量。
@Youcef 你从哪里得到这对近乎相似的图像？某处有存储库吗？
@nathancy 不，我刚刚从谷歌搜索中收集了一些样本

标签： python opencv image-processing computer-vision template-matching

【解决方案1】：

这是一种使用sentence-transformers 库确定重复和接近重复图像的定量方法，该库提供了一种计算图像密集向量表示的简单方法。我们可以使用OpenAI Contrastive Language-Image Pre-Training (CLIP) Model，这是一个已经在各种（图像、文本）对上训练过的神经网络。为了找到图像重复和近似重复，我们将所有图像编码到向量空间中，然后找到与图像非常相似的区域相对应的高密度区域。

当比较两个图像时，它们的分数在0 到1.00 之间。我们可以使用阈值参数将两个图像识别为相似或不同。通过将阈值设置得较低，您将获得更大的集群，其中包含更少的相似图像。重复图像的得分为1.00，这意味着这两个图像完全相同。要查找近似重复的图像，我们可以将阈值设置为任意值，例如0.9。例如，如果两个图像之间的确定分数大于0.9，那么我们可以断定它们是近乎重复的图像。

一个例子：

该数据集有 5 张图像，请注意 cat #1 有重复，而其他则不同。

查找重复图像

Score: 100.000%
.cat1 copy.jpg
.cat1.jpg

cat1 和它的副本都是一样的。

查找近乎重复的图像

Score: 91.116%
.cat1 copy.jpg
.cat2.jpg

Score: 91.116%
.cat1.jpg
.cat2.jpg

Score: 91.097%
.ear1.jpg
.ear2.jpg

Score: 59.086%
.ear2.jpg
.cat2.jpg

Score: 56.025%
.ear1.jpg
.cat2.jpg

Score: 53.659%
.ear1.jpg
.cat1 copy.jpg

Score: 53.659%
.ear1.jpg
.cat1.jpg

Score: 53.225%
.ear2.jpg
.cat1.jpg

我们得到了不同图像之间更有趣的分数比较结果。分数越高，越相似；分数越低，越不相似。使用0.9 或 90% 的阈值，我们可以过滤掉几乎重复的图像。

仅两个图像之间的比较

Score: 91.097%
.ear1.jpg
.ear2.jpg

Score: 91.116%
.cat1.jpg
.cat2.jpg

Score: 93.715%
.	ower1.jpg
.	ower2.jpg

代码

from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import os

# Load the OpenAI CLIP Model
print('Loading CLIP Model...')
model = SentenceTransformer('clip-ViT-B-32')

# Next we compute the embeddings
# To encode an image, you can use the following code:
# from PIL import Image
# encoded_image = model.encode(Image.open(filepath))
image_names = list(glob.glob('./*.jpg'))
print("Images:", len(image_names))
encoded_image = model.encode([Image.open(filepath) for filepath in image_names], batch_size=128, convert_to_tensor=True, show_progress_bar=True)

# Now we run the clustering algorithm. This function compares images aganist 
# all other images and returns a list with the pairs that have the highest 
# cosine similarity score
processed_images = util.paraphrase_mining_embeddings(encoded_image)
NUM_SIMILAR_IMAGES = 10 

# =================
# DUPLICATES
# =================
print('Finding duplicate images...')
# Filter list for duplicates. Results are triplets (score, image_id1, image_id2) and is scorted in decreasing order
# A duplicate image will have a score of 1.00
duplicates = [image for image in processed_images if image[0] >= 1]

# Output the top X duplicate images
for score, image_id1, image_id2 in duplicates[0:NUM_SIMILAR_IMAGES]:
    print("
Score: {:.3f}%".format(score * 100))
    print(image_names[image_id1])
    print(image_names[image_id2])

# =================
# NEAR DUPLICATES
# =================
print('Finding near duplicate images...')
# Use a threshold parameter to identify two images as similar. By setting the threshold lower, 
# you will get larger clusters which have less similar images in it. Threshold 0 - 1.00
# A threshold of 1.00 means the two images are exactly the same. Since we are finding near 
# duplicate images, we can set it at 0.99 or any number 0 < X < 1.00.
threshold = 0.99
near_duplicates = [image for image in processed_images if image[0] < threshold]

for score, image_id1, image_id2 in near_duplicates[0:NUM_SIMILAR_IMAGES]:
    print("
Score: {:.3f}%".format(score * 100))
    print(image_names[image_id1])
    print(image_names[image_id2])

【讨论】：

很好的答案。以下问题是相关的，但不幸的是没有详细的答案，甚至被否决：stackoverflow.com/questions/64520940/…

【解决方案2】：

而不是在找到它们之间的差异/相似性之前使用像素化来处理图像，简单地给他们一些模糊使用cv2.GaussianBlur()方法，然后使用cv2.matchTemplate()方法求它们之间的相似度：

import cv2
import numpy as np

def process(img):
    img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    return cv2.GaussianBlur(img_gray, (43, 43), 21)

def confidence(img1, img2):
    res = cv2.matchTemplate(process(img1), process(img2), cv2.TM_CCOEFF_NORMED)
    return res.max()

img1s = list(map(cv2.imread, ["img1_1.jpg", "img1_2.jpg", "img1_3.jpg"]))
img2s = list(map(cv2.imread, ["img2_1.jpg", "img2_2.jpg", "img2_3.jpg"]))

for img1, img2 in zip(img1s, img2s):
    conf = confidence(img1, img2)
    print(f"Confidence: {round(conf * 100, 2)}%")

输出：

Confidence: 83.6%
Confidence: 84.62%
Confidence: 87.24%

以下是用于上述程序的图像：

img1_1.jpg & img2_1.jpg:

img1_2.jpg & img2_2.jpg:

img1_3.jpg & img2_3.jpg:

为了证明模糊不会产生真正的误报，我运行了这个程序：

import cv2
import numpy as np

def process(img):
    h, w, _ = img.shape
    img = cv2.resize(img, (350, h * w // 350))
    img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    return cv2.GaussianBlur(img_gray, (43, 43), 21)

def confidence(img1, img2):
    res = cv2.matchTemplate(process(img1), process(img2), cv2.TM_CCOEFF_NORMED)
    return res.max()

img1s = list(map(cv2.imread, ["img1_1.jpg", "img1_2.jpg", "img1_3.jpg"]))
img2s = list(map(cv2.imread, ["img2_1.jpg", "img2_2.jpg", "img2_3.jpg"]))

for i, img1 in enumerate(img1s, 1):
    for j, img2 in enumerate(img2s, 1):
        conf = confidence(img1, img2)
        print(f"img1_{i} img2_{j} Confidence: {round(conf * 100, 2)}%")

输出：

img1_1 img2_1 Confidence: 84.2% # Corresponding images
img1_1 img2_2 Confidence: -10.86%
img1_1 img2_3 Confidence: 16.11%
img1_2 img2_1 Confidence: -2.5%
img1_2 img2_2 Confidence: 84.61% # Corresponding images
img1_2 img2_3 Confidence: 43.91%
img1_3 img2_1 Confidence: 14.49%
img1_3 img2_2 Confidence: 59.15%
img1_3 img2_3 Confidence: 87.25% # Corresponding images

请注意，只有在将图像与其对应的图像匹配时，程序才会输出高置信度 (84+%)。

为了比较，这里是结果没有模糊图像：

import cv2
import numpy as np

def process(img):
    h, w, _ = img.shape
    img = cv2.resize(img, (350, h * w // 350))
    return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

def confidence(img1, img2):
    res = cv2.matchTemplate(process(img1), process(img2), cv2.TM_CCOEFF_NORMED)
    return res.max()

img1s = list(map(cv2.imread, ["img1_1.jpg", "img1_2.jpg", "img1_3.jpg"]))
img2s = list(map(cv2.imread, ["img2_1.jpg", "img2_2.jpg", "img2_3.jpg"]))

for i, img1 in enumerate(img1s, 1):
    for j, img2 in enumerate(img2s, 1):
        conf = confidence(img1, img2)
        print(f"img1_{i} img2_{j} Confidence: {round(conf * 100, 2)}%")

输出：

img1_1 img2_1 Confidence: 66.73%
img1_1 img2_2 Confidence: -6.97%
img1_1 img2_3 Confidence: 11.01%
img1_2 img2_1 Confidence: 0.31%
img1_2 img2_2 Confidence: 65.33%
img1_2 img2_3 Confidence: 31.8%
img1_3 img2_1 Confidence: 9.57%
img1_3 img2_2 Confidence: 39.74%
img1_3 img2_3 Confidence: 61.16%

【讨论】：

可能是最简单的答案，并且可能是最好的起点恕我直言。