将没有分隔符和 100 多列的 4 GB 固定列宽文本文件转换为修剪后的制表符分隔文件答案

【问题标题】：Convert 4 GB fixed column width text file with no delimiters and 100+ columns to a trimmed, tab delimited file将没有分隔符和 100 多列的 4 GB 固定列宽文本文件转换为修剪后的制表符分隔文件
【发布时间】：2019-11-08 20:41:01
【问题描述】：

我每月都会收到几个非常大（~ 4 GB）的固定列宽文本文件，需要导入到 MS SQL Server 中。要导入文件，必须将文件转换为具有制表符分隔的列值的文本文件，并从每个列值中删除空格（某些列没有空格）。我想使用 PowerShell 来解决这个问题，并且我希望代码非常非常快。

我尝试了多次代码迭代，但到目前为止太慢或无法正常工作。我试过微软文本解析器（太慢了）。我试过正则表达式匹配。我正在使用安装了 PowerShell 5.1 的 Windows 7 机器。

 ID         FIRST_NAME              LAST_NAME          COLUMN_NM_TOO_LON5THCOLUMN
 10000000001MINNIE                  MOUSE              COLUMN VALUE LONGSTARTS 

$infile = "C:\Testing\IN_AND_OUT_FILES\srctst.txt"
$outfile = "C:\Testing\IN_AND_OUT_FILES\outtst.txt"

$batch = 1

[regex]$match_regex = '^(.{10})(.{50})(.{50})(.{50})(.{50})(.{3})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{4})(.{25})(.{2})(.{10})(.{3})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{2})(.{25})(.{2})(.{10})(.{3})(.{10})(.{10})(.{10})(.{2})(.{10})(.{50})(.{50})(.{50})(.{50})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{2})(.{25})(.{2})(.{10})(.{3})(.{4})(.{2})(.{4})(.{10})(.{38})(.{38})(.{15})(.{1})(.{10})(.{2})(.{10})(.{10})(.{10})(.{10})(.{38})(.{38})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})$'
[regex]$replace_regex = "`${1}`t`${2}`t`${3}`t`${4}`t`${5}`t`${6}`t`${7}`t`${8}`t`${9}`t`${10}`t`${11}`t`${12}`t`${13}`t`${14}`t`${15}`t`${16}`t`${17}`t`${18}`t`${19}`t`${20}`t`${21}`t`${22}`t`${23}`t`${24}`t`${25}`t`${26}`t`${27}`t`${28}`t`${29}`t`${30}`t`${31}`t`${32}`t`${33}"

Get-Content $infile -ReadCount $batch |

    foreach {

        $_ -replace $match_regex, $replace_regex | Out-File $outfile -Append

    }

感谢您提供的任何帮助！

【问题讨论】：

$element = $_.trim() 不会产生任何结果，因为您没有使用foreach-object。 $element = $element.trim() 会产生更好的结果。
如果要将列表转换为制表符分隔的字符串，只需使用$list -join "`t"。
[1] -ReadCount 默认为 1，因此您不会以这种方式获得任何东西。 [grin] [2] \t 在 PoSh 中对于使用反引号而不是斜线的选项卡无效。它可能在 dotnet [regex] 调用中起作用，尽管如此。 [3]你看过StreamReader了吗？这是快速文本文件读/写的通常建议。 [4] 你可以使用$Matches[1..($Matches.Count -1)] -join "t"` [在前面的代码中t之前应该有一个反引号而不是一个空格]从捕获组中构建你的制表符分隔字符串。
不要使用$Input，作为普通变量，它在PowerShell中被保留为automatic variable。
@Mark 表示感谢，但请注意，要让此处的评论者收到您的后续评论通知，您必须@-提及他们，但问题是您只能@-提及一个用户 - 请参阅 meta.stackexchange.com/a/43020/248777

标签： c# .net regex powershell

【解决方案1】：

带有-File 选项的switch 语句是在PowerShell 中处理大文件的最快方法^[1]：

& { 
  switch -File $infile -Regex  {
    $match_regex {
       # Join the what all the capture groups matched, trimmed, with a tab char.
       $Matches[1..($Matches.Count-1)].Trim() -join "`t"
    }
  }
} | Out-File $outFile # or: Set-Content $outFile (beware encoding issues)

^{对于文本输出，Out-File 和 Set-Content 可以互换使用，但在 Windows PowerShell 中它们默认使用不同的字符编码（UTF-16LE 与 Ansi） ;根据需要使用-Encoding； PowerShell Core 始终使用无 BOM 的 UTF-8。}

注意：

要跳过 标题行 或单独捕获它，请为其提供单独的正则表达式，或者，如果标题也与数据行正则表达式匹配，则在 @987654339 之前初始化行索引变量@ 语句（例如，$i = 0）并在处理脚本块中检查并增加该变量（例如，if ($i++ -eq 0) { ... }）。
.Trim() 被隐式调用在数组中的每个字符串上由$Matches[1..($Matches.Count-1)] 返回；此功能称为member-access enumeration
switch 语句包含在 & { ... } 中（script block ({ ... }) 与 call operator (&) 一起调用）的原因是复合语句，例如 switch / while, foreach (...), ... 不直接支持作为管道输入 - 请参阅 this GitHub issue。

至于你尝试了什么：

正如iRon 指出的那样，你不应该使用$Input 作为用户变量——它是一个由PowerShell 管理的automatic variable，事实上，你分配给它的任何东西都会被悄悄地丢弃。

正如AdminOfThings 指出的那样：

$element = $_.trim() 不起作用，因为您在 foreach 循环中，而不是在带有 ForEach-Object cmdlet 的管道中（即使后者也别名为foreach；只有ForEach-Object 才会将$_ 设置为当前输入对象。
不需要自定义函数来连接数组元素和分隔符； -join 运算符直接执行此操作，如上所示。

Lee_Daily 展示了如何将-join 直接与$Matches 数组一起使用，如上所述。

一些旁白：

Join-Str($matches)

您应该改用Join-Str $matches：

在 PowerShell 中，调用函数类似于 shell 命令 - foo arg1 arg2 - 不像 C# 方法 - foo(arg1, arg2);见Get-Help about_Parsing。
如果您使用, 分隔参数，您将构造一个数组，函数将其视为单个参数。
为防止意外使用方法语法，请使用Set-StrictMode -Version 2 或更高版本，但请注意其其他影响。

| Out-Null

一种几乎总是更快的输出抑制方法是改用$null = ...。

^{[1] 与问题中的Get-Content + ForEach-Object 方法相比，Mark（OP）报告了显着的加速（switch 解决方案需要 7.7 分钟。对于 4GB 文件）。

虽然switch 解决方案在大多数情况下可能足够快，但this answer 显示的解决方案对于高迭代次数可能更快； this answer 将其与 switch 解决方案进行对比，并显示具有不同迭代次数的基准。

除此之外，用 C# 编写的编译解决方案是进一步提高性能的唯一方法。}

【讨论】：

【解决方案2】：

这是我的工作代码的高级。请注意，System.IO.StreamReader 的使用对于使处理时间达到可接受的水平至关重要。感谢所有让我来到这里的帮助。

Function Get-Regx-Data-Format() {
    Param ([String] $filename)

    if ($filename -eq 'FILE NAME') {
        [regex]$match_regex = '^(.{10})(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{1})'
    }
    return $match_regex
}

Foreach ($file in $cutoff_files) {

  $starttime_for_file = (Get-Date)
  $source_file = $file + '_' + $proc_yyyymm + $source_file_suffix
  $source_path = $source_dir + $source_file

  $parse_file = $file + '_' + $proc_yyyymm + '_load' +$parse_target_suffix
  $parse_file_path = $parse_target_dir + $parse_file

  $error_file = $file + '_err_' + $proc_yyyymm + $error_target_suffix
  $error_file_path = $error_target_dir + $error_file

  [regex]$match_data_regex = Get-Regx-Data-Format $file

  Remove-Item -path "$parse_file_path" -Force -ErrorAction SilentlyContinue
  Remove-Item -path "$error_file_path" -Force -ErrorAction SilentlyContinue

  [long]$matched_cnt = 0
  [long]$unmatched_cnt = 0
  [long]$loop_counter = 0
  [boolean]$has_header_row=$true
  [int]$field_cnt=0
  [int]$previous_field_cnt=0
  [int]$array_length=0

  $parse_minutes = Measure-Command {
    try {
        $stream_log = [System.IO.StreamReader]::new($source_path)
        $stream_in = [System.IO.StreamReader]::new($source_path)
        $stream_out = [System.IO.StreamWriter]::new($parse_file_path)
        $stream_err = [System.IO.StreamWriter]::new($error_file_path)

        while ($line = $stream_in.ReadLine()) {

          if ($line -match $match_data_regex) {

              #if matched and it's the header, parse and write to the beg of output file
              if (($loop_counter -eq 0) -and $has_header_row) {
                  $stream_out.WriteLine(($Matches[1..($array_length)].Trim() -join "`t"))

              } else {
                  $previous_field_cnt = $field_cnt

                  #add year month to line start, trim and join every captured field w/tabs
                  $stream_out.WriteLine("$proc_yyyymm`t" + `
                         ($Matches[1..($array_length)].Trim() -join "`t"))

                  $matched_cnt++
                  $field_cnt=$Matches.Count

                  if (($previous_field_cnt -ne $field_cnt) -and $loop_counter -gt 1) {
                    write-host "`nError on line $($loop_counter + 1). `
                                The field count does not match the previous correctly `
                                formatted (non-error) row."
                  }

              }
          } else {
              if (($loop_counter -eq 0) -and $has_header_row) {
                #if the header, write to the beginning of the output file
                  $stream_out.WriteLine($line)
              } else {
                $stream_err.WriteLine($line)
                $unmatched_cnt++
              }
          }
          $loop_counter++
       }
    } finally {
        $stream_in.Dispose()
        $stream_out.Dispose()
        $stream_err.Dispose()
        $stream_log.Dispose()
    }
  } | Select-Object -Property TotalMinutes

  write-host "`n$file_list_idx. File $file parsing results....`nMatched Count = 
  $matched_cnt  UnMatched Count = $unmatched_cnt  Parse Minutes = $parse_minutes`n"

  $file_list_idx++

  $endtime_for_file = (Get-Date)
  write-host "`nEnded processing file at $endtime_for_file"

  $TimeDiff_for_file = (New-TimeSpan $starttime_for_file $endtime_for_file)
  $Hrs_for_file = $TimeDiff_for_file.Hours
  $Mins_for_file = $TimeDiff_for_file.Minutes
  $Secs_for_file = $TimeDiff_for_file.Seconds 
  write-host "`nElapsed Time for file $file processing: 
  $Hrs_for_file`:$Mins_for_file`:$Secs_for_file"

}

$endtime = (Get-Date -format "HH:mm:ss")
$TimeDiff = (New-TimeSpan $starttime $endtime)
$Hrs = $TimeDiff.Hours
$Mins = $TimeDiff.Minutes
$Secs = $TimeDiff.Seconds 
write-host "`nTotal Elapsed Time: $Hrs`:$Mins`:$Secs"

【讨论】：

@mklement0 我想你可能想扫描我的解决方案。 System.IO.StreamReader 将处理时间减少了约 75%。