提高 PowerShell 函数从 CSV 中删除重复项的性能答案

【问题标题】：Increase performance of PowerShell function removing duplicates from CSV提高 PowerShell 函数从 CSV 中删除重复项的性能
【发布时间】：2019-11-09 20:45:26
【问题描述】：

我需要利用 PowerShell 解决我在 CSV 中包含大型数据集的问题。我需要将 CSV 读入内存并处理从 CSV 中删除所有重复项。

除了使用 PowerShell、在内存中运行等等之外，主要问题是我必须评估某些列来识别重复而不是整行。

此外，我需要根据包含第一个观察日期的列保留最旧的重复条目。

我尝试了一些不同的东西，例如具有唯一名称的排序对象。

CSV 中的数据集通常包含 1-5 百万行，列看起来类似于：

"LastObserved","FirstObserved","ComputerName","ComputerID","Virtual","ComputerSerialID"

function Request-Dedupe ($data) {
    try {
        Write-Log -Message "Cycling through data to remove duplicates"

        $dedupe_data = @()
        $i = 0
        $n = 0
        foreach ($obj in $data |Sort-Object -Property FirstObserved) {
            if ($obj.ComputerSerialID -notin $dedupe_data.ComputerSerialID -and $obj.ComputerID -notin $dedupe_data.ComputerID) {
                $dedupe_data += $obj
                if ($current_data.ComputerID -contains $obj.ComputerID) {
                   $dedupe_data[$n].LastObserved = $current_time
                }
                $n ++
            }
            Write-Progress -Activity "Cycling through data to remove duplicates and correlate first observed time" -Status "$i items processed" -PercentComplete ([Double]$i / $data.count*100)
            $i ++
        }

        Write-Log -Message "Dedupe Complete"
        return $dedupe_data
    } catch {
        Write-Log -Level Error "Unable to sort and dedupe data"
    }
}
$current_time = (Get-Date).ToUniversalTime().ToString("yyyy-MM-ddTHH:mm:ss")
$current_data = Import-Csv .\UniqueSystems.csv
$test = Request-Dedupe $current_data

我的目标是加快上述速度，可能会利用 C#。

预期输出将从 CSV 中删除所有重复项，并为找到的每个重复项保留最早的“FirstObserved”日期。

【问题讨论】：

标签： performance function powershell csv large-data

【解决方案1】：

为了提高性能，您应该避免追加到数组以及在数组中进行查找。两者都是缓慢的操作。

如果我正确理解您的问题，您希望保留具有相同“ComputerID”和“ComputerSerialID”以及最早的“FirstObserved”值的唯一记录。这可以使用这样的哈希表来实现：

$unique = @{}
Import-Csv .\UniqueSystems.csv | ForEach-Object {
    $key = '{0}/{1}' -f $_.ComputerID, $_.ComputerSerialID
    if (-not $unique.Contains($key) -or $unique[$key].FirstObserved -gt $_.FirstObserved) {
        $unique[$key] = $_
    }
}
$unique.Values

【讨论】：