【问题标题】:Split text file into smaller files based on size (Windows)根据大小将文本文件拆分为较小的文件 (Windows)
【发布时间】:2018-11-09 15:08:07
【问题描述】:

有时会创建日志 (.txt) 文件,这些文件太大而无法打开 (5GB+),我需要创建一个解决方案来拆分成更小的可读块,以便在写字板中使用。这是在 Windows Server 2008 R2 中。

我需要一个批处理文件、powerShell 或类似的解决方案。理想情况下,应该硬编码每个文本文件包含不超过 999 MB 并且不会停在一行中间。

我在https://gallery.technet.microsoft.com/scriptcenter/PowerShell-Split-large-log-6f2c4da0 找到了一个与我的需求类似的解决方案,有时(按行数)可以工作

############################################# 
# Split a log/text file into smaller chunks # 
############################################# 

# WARNING: This will take a long while with extremely large files and uses lots of memory to stage the file 

# Set the baseline counters  
# Set the line counter to 0  
$linecount = 0 

# Set the file counter to 1. This is used for the naming of the log files      
$filenumber = 1

# Prompt user for the path  
$sourcefilename = Read-Host "What is the full path and name of the log file to split? (e.g. D:\mylogfiles\mylog.txt)"   

# Prompt user for the destination folder to create the chunk files      
$destinationfolderpath = Read-Host "What is the path where you want to extract the content? (e.g. d:\yourpath\)"    
Write-Host "Please wait while the line count is calculated. This may take a while. No really, it could take a long time." 

# Find the current line count to present to the user before asking the new line count for chunk files  
Get-Content $sourcefilename | Measure-Object | ForEach-Object { $sourcelinecount = $_.Count }   

#Tell the user how large the current file is  
Write-Host "Your current file size is $sourcelinecount lines long"   

# Prompt user for the size of the new chunk files  
$destinationfilesize = Read-Host "How many lines will be in each new split file?"   

# the new size is a string, so we convert to integer and up 
# Set the upper boundary (maximum line count to write to each file)    
$maxsize = [int]$destinationfilesize     
Write-Host File is $sourcefilename - destination is $destinationfolderpath - new file line count will be $destinationfilesize 

# The process reads each line of the source file, writes it to the target log file and increments the line counter. When it reaches 100000 (approximately 50 MB of text data)  
$content = get-content $sourcefilename | % {
Add-Content $destinationfolderpath\splitlog$filenumber.txt "$_"    
$linecount ++   
If ($linecount -eq $maxsize) { 
    $filenumber++ 
    $linecount = 0    }  }   
# Clean up after your pet  
[gc]::collect()   
[gc]::WaitForPendingFinalizers 
()

但是,当我运行它时,我在 powershell 中遇到许多错误,类似于:

Add-Content : The process cannot access the file 'C:\Desktop\splitlog1.txt' 
because it is being used by another process...

所以我请求帮助修复上述代码,或者请帮助创建不同/更好的解决方案。

【问题讨论】:

  • 避免如此庞大的日志文件,您可能会对LogRotateWin... 感兴趣
  • @aschipfl 我很欣赏你的建议,但是这对我来说并没有真正的帮助。
  • 我一直使用源自同一篇文章的脚本,没有任何问题。根据您看到的错误,您可能在其他地方打开了目标文件。您是否在另一个 shell 中运行“Get-Content split-log1.txt -tail”?

标签: windows powershell batch-file split


【解决方案1】:

好的,我接受了挑战。这是应该为您工作的功能。它可以按行拆分文本文件,在每个输出文件中放入尽可能多的完整输入行,而不会超过 size 字节。

注意:不能严格执行输出文件大小限制。

示例: 输入文件包含两个非常长的字符串,每个 1Mb。如果您尝试将此文件拆分为 512KB 的块,则生成的文件将是每个 1MB。

函数Split-FileByLine

<#
.Synopsis
    Split text file(s) by lines, put into each output file as many complete lines of input as possible without exceeding size bytes.

.Description
    Split text file(s) by lines, put into each output file as many complete lines of input as possible without exceeding size bytes.
    Note, that output file size limit can't be strictly enforced. Example: input files contains two very long strings, 1Mb each.
    If you try to split this file into the 512KB chunks, resulting files will be 1MB each.

    Splitted files will have orinignal file's name, followed by the "_part_" string and counter. Example:
    Original file: large.log
    Splitted files: large_part_0.log, large_part_1.log, large_part_2.log, etc.

.Parameter FileName
    Array of strings, mandatory. Filename(s) to split.

.Parameter OutPath
    String, mandatory. Folder, where splittedfiles will be stored. Will be created, if not exists.

.Parameter MaxFileSize
    Long, mandatory. Maximum output file size. When output file reaches this size, new file will be created.
    You can use PowerShell's multipliers: KB, MB, GB, TB,PB

.Parameter Encoding
    String. If not specified, script will use system's current ANSI code page to read the files.
    You can get other valid encodings for your system in PowerShell console like this:

    [System.Text.Encoding]::GetEncodings()

    Example:

    Unicode (UTF-7): utf-7
    Unicode (UTF-8): utf-8
    Western European (Windows): Windows-1252

.Example
    Split-FileByLine -FileName '.\large.log' -OutPath '.\splitted' -MaxFileSize 100MB -Verbose

    Split file "large.log" in current folder, write resulting files in subfolder "splitted", limit output file size to 100Mb, be verbose.

.Example
    Split-FileByLine -FileName '.\large.log' -OutPath '.\splitted' -MaxFileSize 100MB -Encoding 'utf-8'

    Split file "large.log" in current folder, write resulting files in subfolder "splitted", limit output file size to 100Mb, use UTF-8 encoding.

.Example
    Split-FileByLine -FileName '.\large_1.log', '.\large_2.log' -OutPath '.\splitted' -MaxFileSize 999MB

    Split files "large_1.log" ".\large_2.log" and  in current folder, write resulting files in subfolder "splitted", limit output file size to 999MB.

.Example
    '.\large_1.log', '.\large_2.log' | Split-FileByLine -FileName -OutPath '.\splitted' -MaxFileSize 999MB

    Split files "large_1.log" ".\large_2.log" and  in current folder, write resulting files in subfolder "splitted", limit output file size to 999MB.

#>
function Split-FileByLine
{
    [CmdletBinding()]
    Param
    (
        [Parameter(Mandatory = $true, ValueFromPipeline = $true, ValueFromPipelineByPropertyName = $true)]
        [string[]]$FileName,

        [Parameter(ValueFromPipelineByPropertyName = $true)]
        [string]$OutPath = (Get-Location -PSProvider FileSystem).Path,

        [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
        [long]$MaxFileSize,

        [Parameter(ValueFromPipelineByPropertyName = $true)]
        [string]$Encoding = 'Default'
    )

    Begin
    {
        # Scriptblocks for common tasks
        $DisposeInFile = {
            Write-Verbose 'Disposing StreamReader'
            $InFile.Close()
            $InFile.Dispose()
        }

        $DisposeOutFile = {
            Write-Verbose 'Disposing StreamWriter'
            $OutFile.Flush()
            $OutFile.Close()
            $OutFile.Dispose()
        }

        $NewStreamWriter = {
            Write-Verbose 'Creating StreamWriter'
            $OutFileName = Join-Path -Path $OutPath -ChildPath (
                '{0}_part_{1}{2}' -f [System.IO.Path]::GetFileNameWithoutExtension($_), $Counter, [System.IO.Path]::GetExtension($_)
            )

            $OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
                $OutFileName,
                $false,
                $FileEncoding
            ) -ErrorAction Stop
            $OutFile.AutoFlush = $true
            Write-Verbose "Writing new file: $OutFileName"
        }
    }

    Process
    {
        if($Encoding -eq 'Default')
        {
            # Set default encoding
            $FileEncoding = [System.Text.Encoding]::Default
        }
        else
        {
            # Try to set user-specified encoding
            try
            {
                $FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
            }
            catch
            {
                throw "Not valid encoding: $Encoding"
            }
        }

        Write-Verbose "Input file: $FileName"
        Write-Verbose "Output folder: $OutPath"

        if(!(Test-Path -Path $OutPath -PathType Container)){
            Write-Verbose "Folder doesn't exist, creating: $OutPath"
            $null = New-Item -Path $OutPath -ItemType Directory -ErrorAction Stop
        }

        $FileName | ForEach-Object {
            # Open input file
            $InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
                $_,
                $FileEncoding
            ) -ErrorAction Stop
            Write-Verbose "Current file: $_"

            $Counter = 0
            $OutFile = $null

            # Read lines from input file
            while(($line = $InFile.ReadLine()) -ne $null)
            {
                if($OutFile -eq $null)
                {
                    # No output file, create StreamWriter
                    . $NewStreamWriter
                }
                else
                {
                    if($OutFile.BaseStream.Length -ge $MaxFileSize)
                    {
                        # Output file reached size limit, closing
                        Write-Verbose "OutFile lenght: $($InFile.BaseStream.Length)"
                        . $DisposeOutFile
                        $Counter++
                        . $NewStreamWriter
                    }
                }

                # Write line to the output file
                $OutFile.WriteLine($line)
            }

            Write-Verbose "Finished processing file: $_"
            # Close open files and cleanup objects
            . $DisposeOutFile
            . $DisposeInFile
        }
    }
}

您可以像这样在脚本中使用它:

function Split-FileByLine
{
    # function body here
}

$InputFile = 'c:\log\large.log'
$OutputDir = 'c:\log_split'

Split-FileByLine -FileName $InputFile -OutPath $OutputDir -MaxFileSize 999MB

【讨论】:

  • 似乎有些不对劲...我用它来拆分一个 983,336KB 的文件(每个文件最大 200MB),它给出了 4 个文件 (204,801KB/204,801/204,801/164,136)...注意它们加起来不等于 983。这是否表明某处数据丢失?如果我手动拆分文件,则大小确实会与原始文件相加。
  • @JavaBeast Weird,我会检查一下。
  • @JavaBeast 是的,计数器中的错误导致第一个拆分文件被覆盖。检查更新版本。
  • 我知道我应该避免像“谢谢!”这样的 cmets但我无法阻止自己。谢谢你!功能完美运行,为我节省了大量时间。 @beatcracker 在当天赢得了互联网
  • 非常好的脚本,但对于非常大的文件来说效率并不高。我有一个需要拆分的 1G 文件。我将限制设置为 200MB。但是,在脚本在 15 分钟后仍未完成第一个文件后,我中止了脚本。太长了!在不到一分钟的时间内找到了另一个脚本,它给了我我需要的东西:stackoverflow.com/questions/1001776/…
【解决方案2】:

您可以尝试CoreUtils for Windows 中的split 工具和--line-bytes 参数:

--line-bytes=size

在每个输出文件中放入尽可能多的完整行 在不超过 size 字节的情况下尽可能多地输入。单行或 超过 size 字节的记录被分成多个文件。尺寸 具有与--bytes 选项相同的格式。如果--separator 是 指定,则行确定记录数

示例:split --line-bytes=999MB c:\logs\biglog.txt

【讨论】:

  • 谢谢,但我无法在客户端工作站上添加或安装任何工具。我需要一个 1 文档脚本形式的解决方案,我可以简单地将其传递给用户。
猜你喜欢
  • 1970-01-01
  • 2012-06-26
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-04-30
  • 1970-01-01
相关资源
最近更新 更多