【问题标题】:Extract text from a large file using powershell使用powershell从大文件中提取文本
【发布时间】:2018-11-29 15:05:39
【问题描述】:

我们有一个应用程序会生成许多大型日志文件,我想使用 PowerShell 对其进行解析并以 CSV 或带有分隔符“|”的文本获取输出。我尝试使用选择字符串,但无法获得预期的结果。下面我发布了日志格式和预期结果

日志文件数据:

如何使用PowerShell实现上述结果?

谢谢

【问题讨论】:

  • 您想要的输出格式有一个Start date 的标题,但有两列时间和日期?一般来说,我会使用一个 RegEx 将日志拆分为记录,而另一个非常复杂的则使用(命名的)捕获组获取数据。
  • 谢谢,我已更正问题以包含时间
  • powershell 内置的任何内容都不会读取文件,神奇地确定相关行的块并为您重新格式化它们。您当然可以使用 Get-content 来读取文件和所有可用的 cmdlet,但这只是一个 Extract-Transform-Load 问题。如果你的文件很大(千兆字节),用你最喜欢的编程语言解决这个问题可能会更有效。

标签: powershell


【解决方案1】:

正如我在评论中提到的,您需要分隔记录并尝试使用复杂的正则表达式匹配您的数据。

regex101 上实时查看 RegEx 研究该链接右上角每个元素的解释。

这个脚本:

## Q:\Test\2018\11\29\SO_53541952.ps1

$LogFile = '.\SO_53541952.log'
$CsvFile = '.\SO_53541952.csv'
$ExcelFile='.\SO_53541952.xlsx'

## see the regex live <https://regex101.com/r/1TWm7i/1>
$RE = [RegEx]"(?sm)^Submitter Id +=> (?<SubmitterID>.*?$).*?^Start Time +=> (?<StartTime>[0-9:]{8}) +Start Date +=> (?<StartDate>[0-9\/]{10}).*?^Message Text +=> (?<MessageText>.*?$).*?^Src File +=> (?<SrcFile>.*?$).*?^Dest File +=> (?<DestFile>.*?$)"


$Data = (Get-Content $LogFile -raw) -split "(?sm)(?=^Record Id)" | ForEach-Object {
    If ($_ -match $RE){
        [PSCustomObject]@{
            'Submitter Id' = $Matches.SubmitterId
            'Start Time'   = $Matches.StartTime
            'Start Date'   = $Matches.StartDate
            'Message Text' = $Matches.MessageText
            'Src File'     = $Matches.SrcFile
            'Dest File'    = $Matches.DestFile
        }
    }
}
$Data | Format-Table -Auto
$Data | Export-Csv $CsvFile  -NoTypeInformation -Delimiter '|'

#$Data | Out-Gridview
## with the ImportExcel module you can directly generate an excel file
$Data | Export-Excel $ExcelFile -AutoSize # -Show

在屏幕上有这个示例输出(我将示例修改为可区分):

> .\SO_53541952.ps1

Submitter Id Start Time Start Date Message Text           Src File Dest File
------------ ---------- ---------- ------------           -------- ---------
STMDA@432... 00:02:51   11/29/2018 Copy step successfu... File1... c\temp...
STMDA@432... 00:02:52   11/29/2018 Copy step successfu... File2... c\temp...
STMDA@432... 00:02:53   11/29/2018 Copy step successfu... File3... c\temp...
STMDA@432... 00:02:54   11/29/2018 Copy step successfu... File4... c\temp...

安装Doug Finke's ImportExcel module 后,您将直接获得.xlsx 文件:

【讨论】:

  • 感谢您的帮助。你如何使用正则表达式来解决这个问题非常有趣。我想学习使用正则表达式,你能指出我应该从哪里开始。
  • regular-expression.info 上查找理论,然后在regex101.comsimilar sites 上练习
【解决方案2】:

正如 LotPings 建议的那样,您需要将日志文件内容分成单独的块。 然后使用正则表达式,您可以捕获所需的值并将它们存储在可以导出到 CSV 文件的对象中。

类似这样的:

$log = @"
------------------------------------------------------------------------------
Record Id         => STM
Process Name      => STMDA         Stat Log Time  => 00:02:59
Process Number    => 51657           Stat Log Date  => 11/29/2018
Submitter Id      => STMDA@4322
SNode User Id     => de34fc5

Start Time        => 00:02:59        Start Date     => 11/29/2018
Stop Time         => 00:02:59        Stop Date      => 11/29/2018

SNODE             => dfdvrvbsdfgg         
Completion Code   => 0 
Message Id        => ncpa
Message Text      => Copy step successful.
Ckpt=> Y Lkfl=> N Rstr=> N XLat=> Y 
FASP=> N
From Node         => P
Src File          => File2
Dest File         => c\temp2
Src CCode         => 0              Dest CCode       => 0       
Src Msgid         => ncpa       Dest Msgid       => ncpa
Bytes Read        => 4000           Bytes Written    => 4010    
Records Read      => 5              Records Written  => 5       
Bytes Sent        => 4010           Bytes Received   => 4010    
RUs Sent          => 0              RUs Received     => 1       
------------------------------------------------------------------------------
Record Id         => STM
Process Name      => STMDA         Stat Log Time  => 00:02:59
Process Number    => 51657           Stat Log Date  => 11/29/2018
Submitter Id      => STMDA@4321
SNode User Id     => de34fc5

Start Time        => 00:02:59        Start Date     => 11/29/2018
Stop Time         => 00:02:59        Stop Date      => 11/29/2018

SNODE             => dfdvrvbsdfgg         
Completion Code   => 0 
Message Id        => ncpa
Message Text      => Copy step successful.
Ckpt=> Y Lkfl=> N Rstr=> N XLat=> Y 
FASP=> N
From Node         => P
Src File          => File1
Dest File         => c\temp1
Src CCode         => 0              Dest CCode       => 0       
Src Msgid         => ncpa       Dest Msgid       => ncpa
Bytes Read        => 4000           Bytes Written    => 4010    
Records Read      => 5              Records Written  => 5       
Bytes Sent        => 4010           Bytes Received   => 4010    
RUs Sent          => 0              RUs Received     => 1       
------------------------------------------------------------------------------
Record Id         => STM
Process Name      => STMDA         Stat Log Time  => 00:02:59
Process Number    => 51657           Stat Log Date  => 11/29/2018
Submitter Id      => STMDA@4323
SNode User Id     => de34fc5

Start Time        => 00:02:59        Start Date     => 11/29/2018
Stop Time         => 00:02:59        Stop Date      => 11/29/2018

SNODE             => dfdvrvbsdfgg         
Completion Code   => 0 
Message Id        => ncpa
Message Text      => Copy step successful.
Ckpt=> Y Lkfl=> N Rstr=> N XLat=> Y 
FASP=> N
From Node         => P
Src File          => File3
Dest File         => c\temp3
Src CCode         => 0              Dest CCode       => 0       
Src Msgid         => ncpa       Dest Msgid       => ncpa
Bytes Read        => 4000           Bytes Written    => 4010    
Records Read      => 5              Records Written  => 5       
Bytes Sent        => 4010           Bytes Received   => 4010    
RUs Sent          => 0              RUs Received     => 1       
------------------------------------------------------------------------------
Record Id         => STM
Process Name      => STMDA         Stat Log Time  => 00:02:59
Process Number    => 51657           Stat Log Date  => 11/29/2018
Submitter Id      => STMDA@4324
SNode User Id     => de34fc5

Start Time        => 00:02:59        Start Date     => 11/29/2018
Stop Time         => 00:02:59        Stop Date      => 11/29/2018

SNODE             => dfdvrvbsdfgg         
Completion Code   => 0 
Message Id        => ncpa
Message Text      => Copy step successful.
Ckpt=> Y Lkfl=> N Rstr=> N XLat=> Y 
FASP=> N
From Node         => P
Src File          => File4
Dest File         => c\temp4
Src CCode         => 0              Dest CCode       => 0       
Src Msgid         => ncpa       Dest Msgid       => ncpa
Bytes Read        => 4000           Bytes Written    => 4010    
Records Read      => 5              Records Written  => 5       
Bytes Sent        => 4010           Bytes Received   => 4010    
RUs Sent          => 0              RUs Received     => 1       
------------------------------------------------------------------------------
"@

# first break the log into 'Record Id' blocks
$blocks = @()
$regex = [regex] '(?m)(Record Id[^-]+)'
$match = $regex.Match($log)
while ($match.Success) {
    $blocks += $match.Value
    $match = $match.NextMatch()
} 

# next, parse out the required values for each block and create objects to export
$blocks | ForEach-Object {
    if ($_ -match '(?s)Submitter Id\s+=>\s+(?<submitter>[^\s]+).+Start Time\s+=>\s+(?<starttime>[^\s]+)\s+Start Date\s+=>\s+(?<startdate>[^\s]+).+Message Text\s+=>\s+(?<messagetext>[\w ,.;-_]+).+Src File\s+=>\s+(?<sourcefile>[\w ,.;-_]+).+Dest File\s+=>\s+(?<destinationfile>[\w ,.;-_]+)') {
        [PSCustomObject]@{
            'Submitter Id' = $matches['submitter']
            'Start Time'   = $matches['starttime']
            'Start Date'   = $matches['startdate']
            'Message Text' = $matches['messagetext']
            'Src File'     = $matches['sourcefile']
            'Dest File'    = $matches['destinationfile']
        }
    }
} | Export-Csv -Path '<PATH_TO_YOUR_OUTPUT_CSV>' -Delimiter '|' -NoTypeInformation

这将生成一个包含以下内容的 csv 文件:

"Submitter Id"|"Start Time"|"Start Date"|"Message Text"|"Src File"|"Dest File"
"STMDA@4322"|"00:02:59"|"11/29/2018"|"Copy step successful."|"File2"|"c\temp2"
"STMDA@4321"|"00:02:59"|"11/29/2018"|"Copy step successful."|"File1"|"c\temp1"
"STMDA@4323"|"00:02:59"|"11/29/2018"|"Copy step successful."|"File3"|"c\temp3"
"STMDA@4324"|"00:02:59"|"11/29/2018"|"Copy step successful."|"File4"|"c\temp4"

【讨论】:

    猜你喜欢
    • 2011-01-30
    • 1970-01-01
    • 1970-01-01
    • 2023-03-17
    • 1970-01-01
    • 1970-01-01
    • 2020-08-07
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多