Powershell：改进 LDIF 文件到 CSV 的转换答案

【问题标题】：Powershell: improving LDIF file to CSV conversionPowershell：改进 LDIF 文件到 CSV 的转换
【发布时间】：2020-03-24 12:44:51
【问题描述】：

我有以下代码可以将 LDIF 文件（超过 100.000 行）转换为 CSV 文件（超过 4.000 行），但我不确定我是否对所花费的时间感到满意 - 尽管我不知道真正需要多长时间；也许这是我笔记本电脑上的正常时间（Core i5 第 7 代、16GB RAM、SSD 驱动器）？

还有改进的余地吗？（尤其是解析，如果可能的话，这需要 30 秒）

# Reducing & editing data to process:
# -----------------------------------
$original = Get-Content $IN_ldif_file
$reduced = (($original | select-string -pattern '^cust[A-Z]','^$' -CaseSensitive).Line) -replace ':: ', ': ' -replace '^cust',''
"Writing reduced LDIF file..." # < 1 sec
(Measure-Command { Set-Content $reducedLDIF -Value $reduced -Encoding UTF8 }).TotalSeconds

# Parsing the relevant data:
# --------------------------
$inData = New-Object -TypeName System.IO.StreamReader -ArgumentList $reducedLDIF
$a = @{}                # initialize the temporary hash
$lineNum = $rcdNum = 0  # initialize the counters
"Parsing reduced LDIF file..." # 27-36 sec
(Measure-Command { 
    # Begin reading and processing the input file:
    $results = while (-not $inData.EndOfStream)
    {
        $line = $inData.ReadLine()
        Write-Verbose "$("{0:D4}" -f ++$lineNum)|$("{0:D4}|" -f $rcdNum)$line"

        if (($line -match "^\s*$") -or $inData.EndOfStream )
        {
            # blank line or end of stream - dump the hash as an object and reinit the hash
            [PSCustomObject]$a
            $a = @{}
            $rcdNum++
        } else {
            # build up hash table for the object
            $key, $value = $line -split ": "
            $a[$key] = $value
        }
    }
    $inData.Close()
}).TotalSeconds

# Populating & writing the CSV file:
# ----------------------------------
"Populating the CSV data..." # 7-11 sec
(Measure-Command { 
    $out = $results |
        select  "Attribute01",
                "Attribute02",
                "Attribute03",
                <# etc... #>
                @{n="Attribute39"; E={$_."Attribute20"}}, # Attribute39 (not in LDIF) takes value of Attribute20
                "Attribute40"
}).TotalSeconds

"Writing CSV file..." # < 1 sec
(Measure-Command { $out | Export-CSV $OUT_csv_file -NoTypeInformation }).TotalSeconds

注意：我实际上不需要将“$reduced”数据导出到文件（例如“$reducedLDIF”），但我找到的用于解析的代码段似乎需要一个文件。

谢谢！

【问题讨论】：

select-string 可以读取文件而无需先执行get-content。只需使用其-Path 参数并传入文件名即可。
当您以后想删除空行时，为什么要在select-string 中匹配^$？
好奇：LDIF 文件是底层工作的实际目标，还是您试图最终从 AD 或不同的 LDAP 数据库中获得特定数据位的报告？如果是这样，几乎可以肯定有比解析和转换 LDIF 文件更简单的方法从数据库中获取这些信息。
感谢 AdminOfThings。直接select-string 确实简化了代码，但我认为性能没有任何变化。我需要在 ^$ 上进行匹配，因为我希望 LDIF 文件内容在每个条目之间至少保留一个空白行（否则解析会变得非常复杂）。
@thepip3r：我的目标是从 Linux 上的 OpenLDAP 服务器中提取数据并将其格式化为特定的 CSV 格式。使用当前代码可以很好地完成...我只是觉得它仍然很慢 - 或者考虑到数据量和我的硬件完成这项工作，我可能有不切实际的期望。

标签： performance powershell csv ldif

【解决方案1】：

所以我找到了一种方法，可以将解析时间缩短近一半，方法是重新使用内存中已经存在的 $reduced 变量中的数据：

    $a = @{}                # initialize the temporary hash
    $lineNum = $rcdNum = 0  # initialize the counters
    "Parsing reduced LDIF file..."
(Measure-Command { 
    $results = ForEach ($line in $reduced) {
        Write-Verbose "$("{0:D6}" -f ++$lineNum)|$("{0:D4}|" -f $rcdNum)$line"
        if ($line -match "^\s*$")
        {   # blank line or end of stream - dump the hash as an object and reinit the hash
            [PSCustomObject]$a
            $a = @{}
            $rcdNum++
        }
        else {
            # build up hash table for the object
            $key, $value = $line -split ": "
            $a[$key] = $value
        }
    }
}).TotalSeconds

这已经更容易接受了（大约 16 秒而不是 30 秒）。

【讨论】：