从单独的数据集中删除与值匹配的数据集中的行答案

【问题标题】：Removing rows in a dataset matching a value from a separate dataset从单独的数据集中删除与值匹配的数据集中的行
【发布时间】：2019-09-28 16:38:38
【问题描述】：

我在匹配字符串时遇到了一些问题。

假设我有下表：

broken
vector
unidentified
synthetic
artificial

我有第二个数据集，如下所示：

org1    Fish
org2    Amphibian
org3    vector
org4    synthetic species
org5    Mammal

我想从第二个表中删除与第一个表中的字符串匹配的所有行，以便输出如下所示：

org1    Fish
org2    Amphibian
org5    Mammal

我正在考虑在 bash 中使用 grep -v，但我不太确定如何让它遍历表 1 中的所有字符串。

我正在尝试在 Perl 中解决它，但由于某种原因，它会返回我所有的值，而不仅仅是匹配的值。知道为什么吗？

我的脚本如下所示：

#!/bin/perl -w

($br_str, $dataset) = @ARGV;
open($fh, "<", $br_str) || die "Could not open file $br_str/n $!";

while (<$fh>) {
        $str = $_;
        push @strings, $str;
        next;
    }

open($fh2, "<", $dataset) || die "Could not open file $dataset $!/n";

while (<$fh2>) {
    chomp;
    @tmp = split /\t/, $_;
    $groups = $tmp[1];
    foreach $str(@strings){
        if ($str ne $groups){
            @working_lines = @tmp;
            next;
        }
    }
        print "@working_lines\n";
}

【问题讨论】：

您好，我在脚本中添加了 chomp 并且得到了相同的结果..似乎可以很好地阅读第一组，所以我不确定是什么问题..
查看this post 了解解决类似问题的另一种方法。

标签： loops perl match string-matching

【解决方案1】：

chomp 您的输入并为您的第一个表使用哈希：

use warnings;
use strict;

my ( $br_str, $dataset ) = @ARGV;
open(my $fh, "<", $br_str ) || die "Could not open file $br_str/n $!";

my %strings;
while (<$fh>) {
    chomp;
    $strings{$_}++;
}

open(my $fh2, "<", $dataset ) || die "Could not open file $dataset $!/n";
while (<$fh2>) {
    chomp;
    my @tmp = split /\s+/, $_;
    my $groups = $tmp[1];
    print "$_\n" unless exists $strings{$groups};
}

请注意，我使用\s+ 而不是\t，只是为了让我的复制/粘贴更容易。

【讨论】：

但我们确定是文字吗？如果他们需要为synthetic 排除synthetic-species 怎么办？我认为正则表达式比查找更安全。如果都是单词，那么这很酷:)
Hei zdim，就我而言，我希望它完全一样，这样无论何时它找到合成的，它都会删除整行，而不管字符串的其余部分是什么。所以脚本就像一个魅力:)。
@Aletia Great then :) 应该在问题中或至少在 asnwer 中提到，所有文本只能包含关键字作为单词