从 CSV 中删除特定列与输入文件匹配的行答案

【问题标题】：Remove Rows From CSV Where A Specific Column Matches An Input File从 CSV 中删除特定列与输入文件匹配的行
【发布时间】：2012-07-13 12:52:15
【问题描述】：

我有一个包含多列和多行的 CSV [File1.csv]。

我有另一个 CSV 文件（只有一列），其中列出了特定的单词 [File2.csv]。

如果任何列与 File2 中列出的任何单词匹配，我希望能够删除 File1 中的行。

我原来是用这个的：

 grep -v -F -f File2.csv File1.csv > File3.csv

这在一定程度上奏效了。我遇到的这个问题是包含多个单词的列（例如 word1、word2、word3）。 File2 包含 word2 但没有删除该行。

我厌倦了将单词分散开来看起来像这样：(word1 , word2 , word3)，但原来的命令不起作用。

如何从 File2 中删除包含一个单词并且其中可能包含其他单词的行？

【问题讨论】：

标签： csv grep row

【解决方案1】：

一种使用awk的方式。

script.awk的内容：

BEGIN {
    ## Split line with a doble quote surrounded with spaces.
    FS = "[ ]*\"[ ]*"
}

## File with words, save them in a hash.
FNR == NR {
    words[ $2 ] = 1;
    next;
}

## File with multiple columns.
FNR < NR {
    ## Omit line if eigth field has no interesting value or is first line of
    ## the file (header).
    if ( $8 == "N/A" || FNR == 1 ) {
        print $0
        next
    }

    ## Split interested field with commas. Traverse it searching for a
    ## word saved from first file. Print line only if not found.

    ## Change due to an error pointed out in comments.
    ##--> split( $8, array, /[ ]*,[ ]*/ )
    ##--> for ( i = 1; i <= length( array ); i++ ) {
    len = split( $8, array, /[ ]*,[ ]*/ )
    for ( i = 1; i <= len; i++ ) {
    ## END change.

        if ( array[ i ] in words ) {
            found = 1
            break
        }
    }
    if ( ! found ) {
        print $0
    }
    found = 0
}

假设 File1.csv 和 File2.csv 在 Thor's 答案的 cmets 中提供了内容（我建议将该信息添加到问题中），运行脚本如下：

awk -f script.awk File2.csv File1.csv

输出如下：

"DNSName","IP","OS","CVE","Name","Risk"
"ex.example.com","1.2.3.4","Linux","N/A","HTTP 1.1 Protocol Detected","Information"
"ex.example.com","1.2.3.4","Linux","CVE-2011-3048","LibPNG Memory Corruption Vulnerability (20120329) - RHEL5","High"
"ex.example.com","1.2.3.4","Linux","CVE-2012-2141","Net-SNMP Denial of Service (Zero-Day) - RHEL5","Medium"
"ex.example.com","1.2.3.4","Linux","N/A","Web Application index.php?s=-badrow Detected","High"
"ex.example.com","1.2.3.4","Linux","CVE-1999-0662","Apache HTTPD Server Version Out Of Date","High"
"ex.example.com","1.2.3.4","Linux","CVE-1999-0662","PHP Unsupported Version Detected","High"
"ex.example.com","1.2.3.4","Linux","N/A","HBSS Common Management Agent - UNIX/Linux","High"

【讨论】：

@eloscurosecreto：你是直接从命令行运行它吗？我的意思是，在不使用文件的情况下粘贴它。
我创建了一个 .awk 文件，这就是问题所在。我只是直接从 cli 运行您的代码，它完成时没有任何问题，但是，我用删除文件检查了输出，并在应该被删除的行中发现了多个单词实例。这种方法似乎只是删除了列中包含单个单词（没有“，word2，word3”）的行。
@eloscurosecreto：我根据 Thor's 答案的 cmets 中提供的示例编辑了我的答案。
感谢您帮我解决这个Birei。我在脚本的第 24 行遇到了一个问题，我收到了awk: script.awk: line 24: illegal reference to array array。实际上，我在响应之前尝试纠正问题，但我完全迷失了 awk。
@ethanpil：您可以从 eloscurosecreto 在 Thor 的回答中的评论中获取这两个文件。第八个字段是正确的，因为我更改了Field Separator 变量。因此，每个双引号字符分隔字段，而不是每个逗号。

【解决方案2】：

您可以在File2.csv 中转换包含多个模式的分割线。

Below 使用tr 将包含word1,word2 的行转换为单独的行，然后再将它们用作模式。 <() 构造临时充当文件/fifo（在bash 中测试）：

grep -v -F -f <(tr ',' '\n' < File2.csv) File1.csv > File3.csv

【讨论】：

所以我尝试了你的方法，我仍然得到与grep -v -F -f File2.csv File1.csv > File3.csv相同的结果
那么您需要向我们展示File1.csv 和File2.csv 的确切样本。以上适用于您迄今为止提供的内容。
这里是文件的链接： - File1.csv - Files2.csv 我希望这会有所帮助。谢谢！
File2.csv 中没有项目匹配 File1.csv。
抱歉，试试这些链接：File1.csv & File2.csv。