从单独的列表中选择文本文件中的数据 - perl 或 unix答案

【问题标题】：Select data in a text file from a separate list - perl or unix从单独的列表中选择文本文件中的数据 - perl 或 unix
【发布时间】：2013-02-24 02:32:26
【问题描述】：

我有一个巨大的制表符分隔文件，如下所示：

contig04733 contig00012 77
contig00546 contig01344 12
contig08943 contig00001 14
contig00765 contig03125 88
等

我有一个单独的制表符分隔文件，其中只有这些重叠群对的一个子集，如下所示：

contig04733 contig00012
contig08943 contig00001
等

我想将第一个文件中与第二个文件中列出的行相对应的行提取到一个新文件中。在这个特定的数据集中，我认为在两个文件中每对的哪一种方式应该是相同的。但也想知道如果说：

file1 contig08943 contig00001 14

但在file2中它的

contig00001 contig08943

我仍然想要这种组合，是否也可以为此编写脚本？

我的代码如下。

use strict;
use warnings;

#open the contig pairs list
open (PAIRS, "$ARGV[0]") or die "Error opening the input file with contig pairs";

#hash to store contig IDs - I think?!
my $pairs;

#read through the pairs list and read into memory?
while(<PAIRS>){
    chomp $_; #get rid of ending whitepace
    $pairs->{$_} = 1;
}
close(PAIRS);

#open data file
open(DATA, "$ARGV[1]") or die "Error opening the sequence pairs file\n";
while(<DATA>){
    chomp $_;
    my ($contigs, $identity) = split("\t", $_);
    if (defined $pairs->{$contigs}) {
        print STDOUT "$_\n";
    }
}
close(DATA);

【问题讨论】：

另外，如果可能的话，我想转换第一个文件中的第 3 列 - 将数字除以 100，然后对其求平方根。例如。本例中新文件的第一行是 contig04733 contig00012 0.877
向我们展示您的代码。你试过什么？
@GregBacon - 试试这个：使用严格；使用警告； #open contig pair list open (PAIRS, "$ARGV[0]") or die "Error opening the input file with contig pairs"; #hash 来存储 contig ID - 我想？！我的 $pairs; #读取对列表并读入内存？而（）{ chomp $_; #摆脱结束的空白 $pairs->{$_} = 1; } 关闭（对）； #open data file open(DATA, "$ARGV[1]") or die "Error opening the sequence pairs file\n";而（）{ chomp $_;我的 ($contigs, $identity) = split("\t", $_); if (defined $pairs->{$contigs}) { print STDOUT "$_\n"; } } 关闭（数据）；
@GregBacon 但这需要我删除前两列之间的间隙/制表符，然后再次将它们分开我想....对不起，我对 perl 完全陌生，所以只是试图将网络上的示例拼接在一起!

标签： perl unix text-processing csv

【解决方案1】：

将下面没有运行注释的代码拼凑在一起，以获得一个工作程序。我们从典型的正面内容开始，指示 perl 在您犯常见错误时向您提供有用的警告。

#! /usr/bin/env perl

use strict;
use warnings;

在必要时向用户展示如何正确调用您的程序总是一个不错的选择。

die "Usage: $0 master subset\n" unless @ARGV == 2;

使用read_subset，程序读取命令行中指定的second 文件。因为您的问题表明您不关心订单，例如，即

contig00001 contig08943

等价于

contig08943 contig00001

代码递增$subset{$p1}{$p2} 和$subset{$p2}{$p1}。

sub read_subset {
  my($path) = @_;

  my %subset;
  open my $fh, "<", $path or die "$0: open $path: $!";
  while (<$fh>) {
    chomp;
    my($p1,$p2) = split /\t/;
    ++$subset{$p1}{$p2};
    ++$subset{$p2}{$p1};
  }

  %subset;
}

在 Perl 程序中使用散列来标记您的程序观察到的事件是非常常见的。事实上，Perl 常见问题解答中的许多示例都使用名为 %seen 的哈希，如“我已经看到这个”。

通过删除带有pop 的第二个命令行参数，只留下主文件，让程序使用while (<>) { ... } 轻松读取所有输入行。填充%subset 后，代码将每一行拆分为字段并跳过任何未标记为已看到的行。通过此过滤器的所有内容都将打印在标准输出中。

my %subset = read_subset pop @ARGV;
while (<>) {
  my($f1,$f2) = split /\t/;
  next unless $subset{$f1}{$f2};
  print;
}

例如：

$ 猫文件1
contig04733 contig00012 77
contig00546 contig01344 12
contig08943 contig00001 14
contig00765 contig03125 88

$猫文件2
contig04733 contig00012
contig00001 contig08943

$ perl 提取子集文件 1 文件 2
contig04733 contig00012 77
contig08943 contig00001 14

要创建包含所选子集的新输出，请按如下所示重定向标准输出

$ perl extract-subset file1 file2 >my-subset

【讨论】：

非常感谢 - 特别是为了让我可以从中学习！

【解决方案2】：

尝试使用基于两个键的散列散列（拆分后）

use strict;
use warnings;

#open the contig pairs list
open (PAIRS, "$ARGV[0]") or die "Error opening the input file with contig pairs";

#hash to store contig IDs - I think?!
#my $pairs;

#read through the pairs list and read into memory?
my %all_configs;
while(<PAIRS>){
    chomp $_; #get rid of ending whitepace
    my @parts = split("\t", $_); #split into ['contig04733', 'contig00012', 77]
    #store the entire row as a hash of hashes
    $all_configs{$parts[0]}{$parts[1]} = $_;
    #$pairs->{$_} = 1; 
}
close(PAIRS);

#open data file
open(DATA, "$ARGV[1]") or die "Error opening the sequence pairs file\n";
while(<DATA>){
    chomp $_;
    my ($contigs, $identity) = split("\t", $_);
    #see if we find a row, one way, or the other
    my $found_row = $all_configs{$contigs}{$identity} 
        || $all_configs{$identity}{$contigs};
    #if found, the split, and handle each part    
    if ($found_row) {
        my @parts = split("\t", $found_row);
        #same sequence as in first file
        my $modified_row  = $parts[0]."\t".$parts[1].(sqrt($parts[2]/100));
        #if you want to keep the same sequence as found in second file
        my $modified_row  = $contigs."\t".$identity.(sqrt($parts[2]/100));

        print STDOUT $found_row."\n"; #or
        print STDOUT $modified_row."\n";
    }
}
close(DATA);

【讨论】：