最接近的值不同的文件，具有不同的行数和其他条件（ bash awk 其他）答案

【问题标题】：Closest value different files, with different number of lines and other conditions ( bash awk other)最接近的值不同的文件，具有不同的行数和其他条件（ bash awk 其他）
【发布时间】：2016-02-10 00:28:52
【问题描述】：

我必须通过修改长文件来恢复和old question。

我在两个文件（File1 和 File2）中有两颗星的年龄。星星年龄一栏是 1 美元，其余 13 美元以下的栏是我需要在最后打印的信息。

我试图找到一个年龄，其中星星的年龄相同或最接近。由于文件太大（~25000 行），我不想在整个数组中搜索速度问题。此外，它们的行数可能会有很大差异（在某些情况下可以说是 ~10000）

我不确定这是否是解决问题的最佳方法，但在缺乏更好的方法的情况下，这是我的想法。（如果你有更快更高效的方法，请做）

所有值都具有 12 位小数的精度。现在我只关心第一列（年龄在哪里）。

而且我需要不同的循环。

让我们使用文件 1 中的这个值：

2.326062371284e+05

首先例程应该在 file2 中搜索所有包含的匹配项

2.3260e+05

（这个循环可能会在整个数组中搜索，但是如果有办法在搜索到 2.3261 时立即停止搜索，那么它会节省一些时间）

如果它只找到一个，那么输出应该是那个值。

通常会找到几行，甚至可能多达1000行。这样的话，它应该再次搜索

2.32606e+05

在之前建立的线之间。（我认为这是一个嵌套循环）然后匹配的数量将减少到 ~200

此时，例程应该搜索与之间具有一定容差X的最佳差

2.326062371284e+05

以及所有这 200 行。

这样就有这些文件

文件1

1.833800650355e+05 col2f1 col3f1 col4f1
1.959443501406e+05 col2f1 col3f1 col4f1
2.085086352458e+05 col2f1 col3f1 col4f1
2.210729203510e+05 col2f1 col3f1 col4f1
2.326062371284e+05 col2f1 col3f1 col4f1
2.441395539059e+05 col2f1 col3f1 col4f1
2.556728706833e+05 col2f1 col3f1 col4f1

文件2

2.210729203510e+05 col2f2 col3f2 col4f2
2.354895663228e+05 col2f2 col3f2 col4f2
2.499062122946e+05 col2f2 col3f2 col4f2
2.643228582664e+05 col2f2 col3f2 col4f2
2.787395042382e+05 col2f2 col3f2 col4f2
2.921130362004e+05 col2f2 col3f2 col4f2
3.054865681626e+05 col2f2 col3f2 col4f2

输出文件3（容差3000）

2.210729203510e+05 2.210729203510e+05 col2f1 col2f2 col4f1 col3f2
2.326062371284e+05 2.354895663228e+05 col2f1 col2f2 col4f1 col3f2

重要条件：

输出不应包含重复的行（星星 1 不能有固定年龄，星星 2 的年龄不同，只有最接近的。

你会怎么解决这个问题？

非常感谢！

ps：我已经完全改变了这个问题，因为它向我表明我的推理有一些错误。谢谢！

【问题讨论】：

这是最大公差吗？（即，如果我们发现一个差异在 X 之下的行，即使它不是最接近的，也要停在那里？）
我认为应该有一个公差，我不知道如何定义它。如果最佳最接近值之间的差异太大，它不应该找到答案。（什么是大？10 年。时间列以年为单位）。
当你说the nearest 100 rows (up and down) - 这意味着100行之前加上100之后，还是50之前加上50之后？如果当前行是从一开始的 10 行，这是否意味着之前的 10 行和之后的 50（或 100）行或之前的 10 行和之后的 90 行或其他什么？基本上，编辑您的问题以通过示例准确解释该陈述的含义。此外，为了便于我们进行测试，请编辑您的问题以将文件的大小减少到 10 行或更少，并在给定示例输入的情况下显示预期的输出，如果您的窗口是 4 行而不是 100 行。
如果您的输入中没有 NR 列，请不要在示例中包含它。我们可以数数。前 2 行同上。使您的文件可测试 - 我们不想编辑您的文件并猜测我们需要删除哪些行和/或列来创建您的真实输入格式 - 只需发布您的真实输入和输出格式。
通常最好将新问题作为新问题提出。

标签： bash awk sed gawk

【解决方案1】：

不是 awk 解决方案，其他解决方案也很棒，所以这里是使用 R 的答案

不同数据的新答案，这次不是从文件中读取来烘焙示例：

# Sample data for code, use fread to read from file and setnames to name the colmumns accordingly
set.seed(123)
data <- data.table(age=runif(20)*1e6,name=sample(state.name,20),sat=sample(mtcars$cyl,20),dens=sample(DNase$density,20))
data2 <- data.table(age=runif(10)*1e6,name=sample(state.name,10),sat=sample(mtcars$cyl,10),dens=sample(DNase$density,10))

setkey(data,'age') # Set the key for joining to the age column
setkey(data2,'age') # Set the key for joining to the age column

# get the result
result=data[ # To get the whole datas from file 1 and file 2 at end
         data2[ 
           data, # Search for each star of list 1
           .SD, # return columns of file 2
           roll='nearest',by=.EACHI, # Join on each line (left join) and find nearest value
          .SDcols=c('age','name','dens')]
       ][!duplicated(age) & abs(i.age - age) < 1e3,.SD,.SDcols=c('age','i.age','name','i.name','dens','i.dens') ] # filter duplicates in first file and on difference
# Write results to a file (change separator for wish):
write.table(format(result,digits=15,scientific=TRUE),"c:/test.txt",sep=" ")

代码：

# A nice package to have, install.packages('data.table') if it's no present
library(data.table)
# Read the data (the text can be file names)
stars1 <- fread("1.833800650355e+05
1.959443501406e+05
2.085086352458e+05
2.210729203510e+05
2.326062371284e+05
2.441395539059e+05
2.556728706833e+05")

stars2 <- fread("2.210729203510e+05
2.354895663228e+05
2.499062122946e+05
2.643228582664e+05
2.787395042382e+05
2.921130362004e+05
3.054865681626e+05")

# Name the columns (not needed if the file has a header)
colnames(stars1) <- "age"
colnames(stars2) <- "age"

# Key the data tables (for a fast join with binary search later)
setkey(stars1,'age')
setkey(stars2,'age')

# Get the result (more datils below on what is happening here :))
result=stars2[ stars1, age, roll="nearest", by=.EACHI]

# Rename the columns so we acn filter whole result
setnames(result,make.unique(names(result)))

# Final filter on difference
result[abs(age.1 - age) < 3e3]

所以有趣的部分是两颗星年龄列表中的第一个“加入”，在 stars1 中搜索每个在 stars2 中最近的。

这个给出（列重命名后）：

> result
        age    age.1
1: 183380.1 221072.9
2: 195944.4 221072.9
3: 208508.6 221072.9
4: 221072.9 221072.9
5: 232606.2 235489.6
6: 244139.6 249906.2
7: 255672.9 249906.2

现在我们每个都有最近的，过滤那些足够接近的（这里的绝对差值超过 3 000）：

> result[abs(age.1 - age) < 3e3]
        age    age.1
1: 221072.9 221072.9
2: 232606.2 235489.6

【讨论】：

感谢@Tensibai，安装 R 并尝试运行代码。
照顾 cmets，特别是第一个安装软件包的 :) 我不确定你最后想做什么，但我觉得 R 可能是一个有趣的工具如果您想根据星星的相对距离绘制星星或对数据进行一些统计分析
我想我今年想学习 R ......但遗憾的是我没有找到时间。我刚刚安装并尝试安装包，我得到这个：警告消息：包'data.table'不可用（对于R版本3.2.3）猜猜我有什么版本（3.2.3）。
我在 ubuntu 14.04 上。已修复，我已经更换了镜子，它可以工作。代码编译并与示例一起工作。现在我将尝试使用真实文件。
您能否修改它以将结果保存到文件中？ @Tensibai

【解决方案2】：

Perl 来拯救。这应该非常快，因为它会在给定范围内进行二分搜索。

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

use List::Util qw{ max min };
use constant { SIZE      => 100,
               TOLERANCE => 3000,
           };


my @times2;
open my $F2, '<', 'file2' or die $!;
while (<$F2>) {
    chomp;
    push @times2, $_;
}

my $num = 0;
open my $F1, '<', 'file1' or die $!;
while (my $time = <$F1>) {
    chomp $time;

    my $from = max(0, $num - SIZE);
    my $to   = min($#times2, $num + SIZE);
    my $between;
    while (1) {
        $between = int(($from + $to) / 2);

        if ($time < $times2[$between] && $to != $between) {
            $to = $between;

        } elsif ($time > $times2[$between] && $from != $between) {
            $from = $between;

        } else {
            last
        }
    }
    $num++;
    if ($from != $to) {
        my $f = $time - $times2[$from];
        my $t = $times2[$to] - $time;
        $between = ($f > $t) ? $to : $from;
    }
    say "$time $times2[$between]" if TOLERANCE >= abs $times2[$between] - $time;
}

【讨论】：

我已经编译，但我得到了这个：在最接近的.pl 第 37 行缺少右花括号或方括号，在最接近.pl 第 37 行的行尾语法错误，在 EOF 执行最接近.pl由于编译错误而中止。在第 37 行添加正确的卷曲后，我得到了这个。全局符号“$nr”需要在最接近的.pl 第 38 行显示包名。全局符号“$time”需要在最接近的.pl 第 38 行显示包名。全局符号“$between”需要在最接近的.pl 第 38 行显示包名. 由于编译错误，closest.pl 的执行中止。
它对我有用。你有什么 Perl 版本？ perl -v。你注意到右边有一个滚动条了吗？
没有或有正确的卷曲？版本：（v5.18.2）
@Nikko：这里发布。什么都没有（您可能没有复制右大括号，而是手动将其添加到错误的位置）。
我已经更新了问题。原则上它是有效的，但我只期待两行输出。你能解决它吗？