识别和收集多维数组中的潜在重复项答案

【问题标题】：Identifying and collecting potential duplicates in multidimensional array识别和收集多维数组中的潜在重复项
【发布时间】：2021-08-01 06:02:51
【问题描述】：

我有一个类似下面的数组，我想在其中识别重复项；

$names = array(
    array("Name" => "John Smith",      "ID" => 65), 
    array("Name" => "Richard Johnson", "ID" => 96), 
    array("Name" => "John Smith",      "ID" => 1105),
    ...
)

有很多类似的问题，但大多数只涉及在存在重复项时返回真/假值，或者只是删除重复项。如果有与此问题相同的问题，我深表歉意，我查看但找不到适合我情况的问题。

我只想识别一对包含相同“名称”值但可能不同“ID”值的数组。我知道这可能会导致重复值的结果重复，但我想我可以自己解决这个问题。我不想删除重复的值，我只想识别它们。

理想情况下，它会返回一个类似于以下（或类似）的数组；

$dupes = array(
    array(
        array("Name" => John Smith, "ID" => 65),
        array("Name" => John Smith, "ID" => 1105)
    )
)

然后我可以将其处理成更精致和用户友好的数组。

我正在考虑使用递归 in_array 函数，或者可能是第二个工作数组。有什么想法吗？

【问题讨论】：

您是否先尝试了一个简单的foreach 循环？这应该始终是您的首选解决方案；您可以在之后重构以使用内置函数。
@El_Vanja 我现在正在尝试一个，但数组中有 16k+ 个条目，我唯一的想法是将每个名称与其他名称进行比较，但这就像 16k^16k 操作
这绝对不是一个好方法，但您不需要将它们全部比较。可以应用分组来实现您的需要（请参阅答案）。

标签： php arrays duplicates

【解决方案1】：

您为什么不直接循环并使用名称作为键创建一个新数组？在此处测试以下内容：http://phptester.net/

$names = array(
    array("Name" => 'John Smith', "ID" => 65), 
    array("Name" => 'Richard Johnson', "ID" => 96), 
    array("Name" => 'John Smith', "ID" => 1105)
);

$users = [];
foreach($names as $usersArray){
    
    $users[$usersArray['Name']]['ids'][] = $usersArray['ID'];
    
}

print_r($users);

或者简单地说：

$names = array(
    array("Name" => 'John Smith', "ID" => 65), 
    array("Name" => 'Richard Johnson', "ID" => 96), 
    array("Name" => 'John Smith', "ID" => 1105)
);

$users = [];
foreach($names as $usersArray){
    
    $users[$usersArray['Name']][] = $usersArray['ID'];
    
}

print_r($users);

【讨论】：

在此之后剩下要做的就是通过名称下的 ID 计数过滤结果（这是一种暗示，但对于任何未来的初学者读者来说最好是明确的）。跨度>
这也很好用！这也不会导致任何内存问题，并且似乎工作得非常快。谢谢！

【解决方案2】：

对于这种情况，您通常需要定义一些逻辑，以根据记录中的值创建哈希以确定相等性。定义好之后，您可以使用简单的循环和关联数组来跟踪哪些记录有重复。

<?php
/**
 * Define an algorithm for equality between records.
 *
 * @param $record
 * @return string
 */
function generateHashForUserRecord($record)
{
    return sha1($record['Name']);
}

$names = [
    ['Name' => 'John Smith', 'ID' => 65],
    ['Name' => 'Richard Johnson', 'ID' => 96],
    ['Name' => 'John Smith', 'ID' => 1105]
];

// This map will be an populated with all records, keyed by hash
$hashBuffer = [];

// Buffer for hashes that are associated with more than one record
$duplicateHashes = [];

// This will be populated with the duplicate records
$duplicateRecords = [];

// Iterate through all of the records
foreach($names as $currRecord)
{
    // Generate a has for the record
    $currHash = generateHashForUserRecord($currRecord);

    // If the hash is not in the hashtable yet, create an array to hold entries with this hash
    if(!array_key_exists($currHash, $hashBuffer))
    {
        $hashBuffer[$currHash] = [];
    }
    else // If this hash is already in the buffer, we have a duplicate - add it to the  $duplicateHashes array
    {
        $duplicateHashes[$hash] = $currHash;
    }

    // Add the record to the hash buffer
    $hashBuffer[$currHash][] = $currRecord;
}

foreach($duplicateHashes as $currDuplicateHash)
{
    $duplicateRecords = array_merge($duplicateRecords, $hashBuffer[$currDuplicateHash]);
}

print_r($duplicateRecords);

这是很多丑陋的过程代码，所以将它封装在某种帮助类中可能是个好主意。

<?php

$names = [
    ['Name' => 'John Smith', 'ID' => 65],
    ['Name' => 'Richard Johnson', 'ID' => 96],
    ['Name' => 'John Smith', 'ID' => 1105]
];

$duplicateRecords = UserRecordHelper::getDuplicateRecords($names);

print_r($duplicateRecords);

class UserRecordHelper
{
    public static function getDuplicateRecords($records)
    {
        // This map will be an populated with all records, keyed by hash
        $hashBuffer = [];

        // Buffer for hashes that are associated with more than one record
        $duplicateHashes = [];

        // This will be populated with the duplicate records
        $duplicateRecords = [];


        // Iterate through all of the records
        foreach ($records as $currRecord)
        {
            // Generate a has for the record
            $currHash = self::generateHashForUserRecord($currRecord);

            // If the hash is not in the hashtable yet, create an array to hold entries with this hash
            if (!array_key_exists($currHash, $hashBuffer))
            {
                $hashBuffer[$currHash] = [];
            }
            else // If this hash is already in the buffer, we have a duplicate - add it to the  $duplicateHashes array
            {
                $duplicateHashes[$hash] = $currHash;
            }

            // Add the record to the hash buffer
            $hashBuffer[$currHash][] = $currRecord;
        }

        foreach ($duplicateHashes as $currDuplicateHash)
        {
            $duplicateRecords = array_merge($duplicateRecords, $hashBuffer[$currDuplicateHash]);
        }

        return $duplicateRecords;
    }

    public static function generateHashForUserRecord($record)
    {
        return sha1($record['Name']);
    }
}

【讨论】：

我不明白哈希的目的。名称可以直接比较。
@El_Vanja 它确实违反了 YAGNI 原则，但另一方面......你可能会需要它。
是的，在这种情况下你可以这样做，但通常你会有更多的逻辑 - 更多的字段等。你可能至少希望在散列等之前使所有字符大写或小写。你甚至可以从函数中返回裸名字符串。关键是您希望有一个明确定义的单点来确定记录之间的相等性。在集合中确定相等性的散列通常是这样做的方式。
这似乎很好用！我遇到的唯一问题是它试图分配过多的内存，但在尝试一次将其应用于 16000 行时这是合理的。我一次只做大块的比较。谢谢！