【问题标题】:CSV formatting - strip qualifier from specific fieldsCSV 格式 - 从特定字段中去除限定符
【发布时间】:2015-03-11 18:51:03
【问题描述】:

如果之前有人问过这个问题,我很抱歉,但我找不到类似的东西。

我收到的 CSV 输出使用 " 作为每个字段周围的文本限定符。我正在寻找一种优雅的解决方案来重新格式化这些,以便只有特定的(字母数字字段)具有这些限定符。

我收到的一个例子:

"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"

我想要的输出是这样的:

"TRI-MOUNTAIN/MOUNTAI","F258273",41016053,"A",10/16/14,3,1,"Recruit-Navy,XL#28-75",13.25,13.25

非常感谢任何建议或帮助!

根据以下请求查找示例文件的前五行:

"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","","10/16/14","","1","Recruit-Navy,XL#28-75","13.25","13.25"
"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","","10/16/14","","1","High Peak-Navy,XL#21-18","36.75","36.75"
"TRI-MOUNTAIN/MOUNTAI","F257186","Z1023384","","10/15/14","","1","Patriot-Red,L#26-35","25.50","25.50"
"TRI-MOUNTAIN/MOUNTAI","F260780","Z1023658","","10/20/14","","1","Exeter-Red/Gray,S#23-52","19.75","19.75"
"TRI-MOUNTAIN/MOUNTAI","F260780","Z1023658","","10/20/14","","1","Exeter-White/Gray,XL#23-56","19.75","19.75"

请注意,这只是一个示例,并非所有文件都适用于 Tri-Mountain。

【问题讨论】:

    标签: csv export-to-csv


    【解决方案1】:

    由于您没有指定操作系统或语言,这里是 PowerShell 版本。

    由于您的非标准 CSV 文件,我放弃了之前使用 Import-CSV 的尝试,转而使用原始文件处理。也应该明显更快。

    拆分 CSV 的正则表达式来自这个问题:How to split a string by comma ignoring comma in double quotes

    将此脚本另存为StripQuotes.ps1。它接受以下参数:

    • InPath从中读取 CSV 的文件夹。如果未指定,则使用当前目录。
    • OutPath保存已处理 CSV 的文件夹。将被创建,如果不存在。
    • 编码如果未指定,脚本将使用系统当前的 ANSI 代码页来读取文件。您可以在 PowerShell 控制台中为您的系统获取其他有效编码,如下所示:[System.Text.Encoding]::GetEncodings()
    • 详细脚本会通过Write-Verbose 消息告诉你发生了什么。

    示例(从 PowerShell 控制台运行)。

    处理文件夹C:\CSVs_are_here中的所有CSV,将处理后的CSV保存到文件夹C:\Processed_CSVs,详细:

    .\StripQuotes.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Verbose
    

    StripQuotes.ps1 脚本:

    Param
    (
        [Parameter(ValueFromPipelineByPropertyName = $true)]
        [ValidateScript({
            if(!(Test-Path -LiteralPath $_ -PathType Container))
            {
                throw "Input folder doesn't exist: $_"
            }
            $true
        })]
        [ValidateNotNullOrEmpty()]
        [string]$InPath = (Get-Location -PSProvider FileSystem).Path,
    
        [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
        [ValidateScript({
            if(!(Test-Path -LiteralPath $_ -PathType Container))
            {
                try
                {
                    New-Item -ItemType Directory -Path $_ -Force
                }
                catch
                {
                    throw "Can't create output folder: $_"
                }
            }
            $true
        })]
        [ValidateNotNullOrEmpty()]
        [string]$OutPath,
    
        [Parameter(ValueFromPipelineByPropertyName = $true)]
        [string]$Encoding = 'Default'
    )
    
    
    if($Encoding -eq 'Default')
    {
        # Set default encoding
        $FileEncoding = [System.Text.Encoding]::Default
    }
    else
    {
        # Try to set user-specified encoding
        try
        {
            $FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
        }
        catch
        {
            throw "Not valid encoding: $Encoding"
        }
    }
    
    $DQuotes = '"'
    $Separator = ','
    # https://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
    $SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
    # Matches a single code point in the category "letter".
    $AlphaNumRegex = '\p{L}'
    
    Write-Verbose "Input folder: $InPath"
    Write-Verbose "Output folder: $OutPath"
    
    # Iterate over each CSV file in the $InPath
    Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
        ForEach-Object {
            Write-Verbose "Current file: $($_.FullName)"
            $InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
                $_.FullName,
                $FileEncoding
            ) -ErrorAction Stop
            Write-Verbose 'Created new StreamReader'
    
            $OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
                (Join-Path -Path $OutPath -ChildPath $_.Name),
                $false,
                $FileEncoding
            ) -ErrorAction Stop
            Write-Verbose 'Created new StreamWriter'
    
            Write-Verbose 'Processing file...'
            while(($line = $InFile.ReadLine()) -ne $null)
            {
                $tmp = $line -split $SplitRegex |
                            ForEach-Object {
                                # Strip double quotes, if any
                                $item = $_.Trim($DQuotes)
    
                                if($_ -match $AlphaNumRegex)
                                {
                                    # If field has at least one letter - wrap in quotes
                                    $DQuotes + $item + $DQuotes
                                }
                                else
                                {
                                    # Else, pass it as is
                                    $item
                                }
                            }
                # Write line to the new CSV file
                $OutFile.WriteLine($tmp -join $Separator)
            }
    
            Write-Verbose "Finished processing file: $($_.FullName)"
            Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"
    
            # Close open files and cleanup objects
            $OutFile.Flush()
            $OutFile.Close()
            $OutFile.Dispose()
    
            $InFile.Close()
            $InFile.Dispose()
        }
    

    【讨论】:

    • 感谢您的建议!尝试此操作时收到错误消息: Import-Csv:无法处理参数,因为参数“名称”的值无效。更改“名称”参数的值并再次运行该操作。在 C:\psigenoutput\convertcsv2.ps1:7 char:11 + Import-Csv
    • 嗯,你的 CSV 文件有标题吗?因为好像they don't,还是开头有一些空白列。如果你用一个例子更新你的问题,例如就像实际文件的前 5 行一样,我将能够修复脚本。
    • @Jeff Btw,我可以制作脚本来扫描文件夹中的 CSV 文件并将处理过的文件保存在子文件夹旁边或子文件夹中,只需命名即可。因为目前这意味着您必须手动编辑脚本来处理新文件,这很无聊。您处理这些文件的工作流程是什么?
    • 生成原始 csv 文件,并将其放入静态名称为 psdet.csv 和 pshead.csv 的文件夹中。我可以在任何文件夹中生成这些文件,这并不重要。我需要脚本从文件夹中提取文件,并在更正后将它们放入具有相同名称的不同文件夹中。我会将文件的前五行粘贴到原始问题中。
    • 我会尽快对更新后的脚本进行测试,并告诉你进展如何。谢谢!
    【解决方案2】:

    这个问题提出了将引号与逗号分隔字段分开的困难,这些字段本身包含嵌入的逗号。 (例如:"Recruit-Navy,XL#28-75")从 shell 的角度来看,有很多方法可以解决这个问题(while readawk 等),但大多数最终都会偶然发现嵌入的逗号。

    一种成功的方法是蛮力character-by-character 解析行。 (下)这不是一个优雅的解决方案,但它会让你开始。 shell 程序的另一种替代方案是编译语言,例如 C,其中字符处理更加健壮。如果您有任何问题,请发表评论。

    #!/bin/bash
    
    declare -a arr
    declare -i ct=0
    
    ## fill array with separated fields (preserving comma in fields)
    #  Note: the following is a single-line (w/continuations for readability)
    arr=( $( line='"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"'; \
    for ((i=0; i < ${#line}; i++)); do \
        if test "${line:i:1}" == ',' ; then \
            if test "${line:i+1:1}" == '"' -o "${line:i-1:1}" == '"' ; then \
                printf " "; \
            else \
                printf "%c" ${line:i:1}; \
            fi; \
        else \
            printf "%c" ${line:i:1}; \
        fi; \
    done; \
    printf "\n" ) )
    
    ## remove quotes from non-numeric fields
    for i in "${arr[@]}"; do 
        if [[ "${i:0:1}" == '"' ]] && [[ ${i:1:1} == [0123456789] ]]; then
            arr[$ct]="${i//\"/}"
        else
            arr[$ct]="$i"
        fi
        if test "$ct" -eq 0 ; then
            printf "%s" "${arr[ct]}"
        else
            printf ",%s" "${arr[ct]}"
        fi
        ((ct++))
    done
    
    printf "\n"
    
    exit 0
    

    输出

    $ bash sepquoted.sh
    "TRI-MOUNTAIN/MOUNTAI","F258273",41016053,"A",10/16/14,3,1,"Recruit-Navy,XL#28-75",13.25,13.25
    

    原创

    "TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-04-11
      • 1970-01-01
      • 2021-12-03
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多