变量检查和总结答案

【问题标题】：Variable check and summary out变量检查和总结
【发布时间】：2015-04-01 00:24:14
【问题描述】：

问题/疑问

我正在尝试对数据集中的变量列表（收入、成本、利润和 vcost）进行简单检查，从每个变量中获取最大和第二大变量，检查它们的总和是否大于变量总和的 90%，如果是，则标记该变量。我还想检查最大的变量是否不大于总和的 60%。

我从Macro that outputs table with testing results of SAS table Macro that outputs table with testing results of SAS table 得到了一些帮助，但现在我试图回答一个更基本的问题。这似乎并不难，但我不知道最后如何设置基本表。

我知道所有的变量名。

这是我创建的示例数据集：https://www.dropbox.com/s/x575w5d551uu47p/dataset%20%281%29.csv?dl=0

期望的输出

我想把这张基本表翻过来：

像这样进入另一个表：

可重现的例子

/* Create some dummy data with three variables to assess */
data have;
    do firm = 1 to 3;
        revenue = rand("uniform");
        costs = rand("uniform");
        profits = rand("uniform");
        vcost = rand("uniform");
        output;
    end;
run;

【问题讨论】：

真实场景中你有多少条记录？
@NEOmen 你说的记录是什么意思？如果你的意思是观察，那么成千上万。如果你的意思是变量，那么有几十个。
记录是指观察...数据集中的个体行通常称为记录
@NEOmen 对不起，我只知道他们是 obs。
这仍然不是一个好问题。问题中没有明确的问题，这最终仍然是“这是一个规范，请做我的工作”。同样，除非您每小时支付 100 美元，否则这是不合适的。虽然这里显然有几个人愿意免费工作，但总的来说，我们需要一个具体的问题，它不是“需要的输出”，而是一个“我如何做某事”，它是具体的、有重点的，并且基于你编写的代码或可能会用特定功能或通常有用的东西来回答的东西，而不是您特定工作的代码。

标签： sas

【解决方案1】：

根据您对上一个答案的评论。看起来 top_2_total 是 2 个最大值的总和。为此，您需要编写一些额外的步骤。我正在使用 proc transpose 和 datastep 来获得上一个答案中已经实现的功能。我已经对 PROC Summary 进行了编码，以获得前 2 个最大值并重用数据集来创建最终答案。如果有帮助，请告诉我。

data have;
    do firm = 1 to 3;
        revenue = rand("uniform");
        costs = rand("uniform");
        profits = rand("uniform");
        vcost = rand("uniform");
        output;
    end;
run;

proc transpose data=have out=want prefix=top_;
    var revenue--vcost;
run;

data want;
set want end=eof;
    array top(*) top_3-top_1;
    call sortn(of top[*]);
    total=sum(of top[*]);
run;
/* Getting the maximum 2 total values using PROC SUMMARY*/
proc summary data=want nway;
    output out=total_top_2_rec(drop=_:) idgroup(max(total) out[2](total)=);
run;

data want;
/* Loop to get the values from previous step and generate TOP_2_TOTAL variable */
if _n_=1 then set total_top_2_rec;
    top_2_total=sum(total_1,total_2);

set want;
    if sum(top_1,top_2) > 0.9  * top_2_total then Flag90=1; else Flag90=0;
    if top_1 > top_2_total * 0.6 then Flag60=1; else Flag60=0;

drop total_1 total_2;
run;

proc print data=want;run;

编辑：我在我的 PROC TRANSPOSE 之前添加了一个逻辑，您可以在其中添加要考虑的变量以进行计算，其余部分由代码完成。此后，代码执行者无需进行任何手动更改。变量应作为空格分隔的列表输入。

data have;
infile 'C:\dataset (1).csv' missover dsd dlm=',' firstobs=2;
input firm v1 v2 v3;
run;

/* add/remove columns here to consider variable */
%let variable_to_consider=v1 
                          v2 
                          v3
                          ;

%let variable_to_consider=%cmpres(&variable_to_consider);
proc sql noprint;
  select count(*) into : obs_count from have;
quit;
%let obs_count=&obs_count;

proc transpose data=have out=want prefix=top_;
    var &variable_to_consider; 
run;

data want;
set want end=eof;
    array top(*) top_&obs_count.-top_1;
    x=dim(top);
    call sortn(of top[*]);
    total=sum(of top[*]);

keep total top_1 top_2 _name_;
run;

/* Getting the maximum 2 total values using PROC SUMMARY*/
proc summary data=want nway;
    output out=total_top_2_rec(drop=_:) idgroup(max(total) out[2](total)=);
run;

data want;
/* Loop to get the values from previous step and generate TOP_2_TOTAL variable */
if _n_=1 then set total_top_2_rec;
    top_2_total=sum(total_1,total_2);

set want;
    if sum(top_1,top_2) > 0.9  * top_2_total then Flag90=1; else Flag90=0;
    if top_1 > top_2_total * 0.6 then Flag60=1; else Flag60=0;

drop total_1 total_2;
run;

proc print data=want;run;

EDIT 2014-04-05 : 如前所述，我已经更新了逻辑并修复了问题。以下是更新后的代码。

data have1;
    do firm = 1 to 3;
        revenue = rand("uniform");
        costs = rand("uniform");
        profits = rand("uniform");
        vcost = rand("uniform");
        output;
    end;
run;

data have2;
infile 'dataset (1).csv' missover dsd dlm=',' firstobs=2;
input firm v1 v2 v3;
run;
/* add/remove columns here to consider variable */

%macro mymacro(input_dataset= ,output_dataset=, variable_to_consider=);

%let variable_to_consider=%cmpres(&variable_to_consider);
proc sql noprint;
  select count(*) into : obs_count from &input_dataset;
quit;
%let obs_count=&obs_count;

proc transpose data=&input_dataset out=&output_dataset prefix=top_;
    var &variable_to_consider; 
run;

data &output_dataset;
set &output_dataset end=eof;
    array top(*) top_&obs_count.-top_1;
    x=dim(top);
    call sortn(of top[*]);
    total=sum(of top[*]);

top_2_total=sum(top_1, top_2);
    if sum(top_1,top_2) > 0.9  * total then Flag90=1; else Flag90=0;
    if top_1 > total * 0.6 then Flag60=1; else Flag60=0;

keep total top_1 top_2 _name_ top_2_total total Flag60 Flag90;

run;
%mend mymacro;

%mymacro(input_dataset=have1, output_dataset=want1 ,variable_to_consider=revenue costs profits vcost)
%mymacro(input_dataset=have2, output_dataset=want2 ,variable_to_consider=v1 v2 v3 )


proc print data=want1;run;
proc print data=want2;run;

【讨论】：

谢谢。这行得通，但我只需要打印前 2 个。当我从数据集中添加其他变量和观察值时，它会将它们全部打印出来并且不再对它们进行排名，因此 top1 和 top2 不是每个变量的顶级观察值。
谢谢，但是当我将它应用于更大的数据集时，这不起作用。我已将示例数据集添加到问题中。
我的错，我把事情搞混了。多任务处理有时真的会搞砸事情。我意识到了错误并修复了它。再试一次！
其实，我只是在玩这个，发现 top_2_total 是完全一样的，所以有些不对劲。运行您的第一个示例将显示这一点。
运行代码后能否查看日志。我猜您可能在 %let 语句中输入了不在数据集上的变量。这可能是 top_2_total 连续两次运行不同数据的代码相同的唯一原因。并且 proc print 将打印较旧的结果。

【解决方案2】：

这里最困难的部分是提取每个变量的前 2 个值。这在大多数 SQL 实现中都很简单，但在 SAS 中，我认为 proc sql 不支持 select top n 语法。

我可以想到几种可能的方法：

按每个感兴趣的变量按降序对数据集进行排序，从前 2 个观察值中检索值，转置并将它们全部附加在一起 - 由于多种排序，这非常低效，而且并不简单比其他方法。
编写一个（相当复杂的）数据步骤来提取每个变量的前 2 个值。
获取 proc 单变量来为您提取最高值，然后将输出数据集转置为正确的格式。

数据步法

data top2;
  array v{4} revenue costs profits vcost;
  array top1{4} (4*0);
  array top2{4} (4*0);
  set have end = eof;
  do i = 1 to 4;
    if v[i] > top1[i] then do;
      top2[i] = top1[i];
      top1[i] = v[i];
    end;
    if top2[i] < v[i] < top1[i] then top2[i] = v[i];
  end;
  length varname $8;
  if eof then do i = 1 to 4;
    varname = vname(v[i]);
    top_1    = top1[i];
    top_2    = top2[i];
    top_2_total = top_1 + top_2;
    output;
  end;
  keep varname top_:;
run;

Proc 单变量方法

ods _all_ close;
ods output extremeobs = extremeobs(keep = varname high);
proc univariate data = have(drop = firm);
run;
ods listing;

data top2_b;
    set extremeobs;
    by varname notsorted;
    if first.varname then do;
        i = 0;
        call missing(top_2);
    end;
    i + 1;
    retain top_2;
    if i = 4 then top_2 = high;
    if i = 5 then do;
        top_1 = high;
        top_2_total = top_1 + top_2;
        output;
    end;
    drop i high;
run;

一旦你得到这个，你可以将它与你现有的简单表从 proc mean / proc summary 合并，并计算任何进一步的感兴趣的度量。

【讨论】：

谢谢。这非常接近，但是我将如何添加检查...我尝试添加 sum = Total; Flag90 = top_2_total > Total* 0.9; Flag60 = top_1 > Total * 0.6; 但它不会在汇总表上打印这些。我将如何在数据步骤方法中添加检查？感谢您的精彩回答
还有其他帮助吗？谢谢
我已经做到了这一点，但仍然需要得到总和并检查所有变量总和的计算......data top2; set top2; if sum(top_1 + top_2) > 0.9 * top_2_total then Flag90=1; else Flag90=0; if top_1 > top_2_total * 0.6 then Flag60=1; else Flag60=0; run;
根据您之前的评论，您的变量 flag90 将始终等于 0，因为 top_2_total = top_1+top_2 并且您正在检查 sum(top_1 + top_2) > 0.9 * top_2_total 这将始终为假。你能重新检查一下 flag90 的逻辑吗
@sushil 是的，我知道，这就是我要解决的问题....我希望 top_2_total 等于列出的变量的总和....top_2_total 应该是总和变量，以便适用 90% 规则，但我无法弄清楚....谢谢

【解决方案3】：

最后一步中的flag1和flag2对于分子大于或等于分母的值将具有一个正整数，如果分子小于分母则为零。

data have(drop=firm);
    do firm = 1 to 4;
        VarName = 'Variable';
        revenue = rand("uniform");
        costs = rand("uniform");
        profits = rand("uniform");
        vcost = rand("uniform");
        output;
    end;
run;

Proc Transpose data=have out=transout
name=Variable
prefix=Var_;
run;

options Mprint;

%Macro calcflag(Varlist);
proc sql;
create table outtable as
select Variable,
sum(&Varlist) as Sum_var,
Largest(1,&Varlist) as Top_1,
Largest(2,&Varlist) as Top_2,
sum(Largest(1,&Varlist),Largest(2,&Varlist)) as Top_2_total,
floor(sum(Largest(1,&Varlist),Largest(2,&Varlist))/(sum(&Varlist)*0.9)) as flag1,
floor(Largest(1,&Varlist)/(sum(&Varlist)*0.6)) as flag2 
from transout;
quit;
%mend;

%calcflag(%str(Var_1,Var_2,Var_3,Var_4));

【讨论】：

谢谢。这可行，但我不需要打印出所有变量，只需为每个变量打印 sum、top1、top2、flag90 和 flag60。所以没有 var1-var4
我编辑了选择列表以将 Var_1 删除到 Var_4。希望这行得通。正如我之前所说，如果你有一个庞大的数据集，这两个步骤可以压缩为一个。这是我能想到的最简单的方法，不涉及任何排序、合并等。祝你好运！
谢谢。我确实有一个大数据集。我将如何压缩这两个步骤？
完成；我以前也有错误的 0.6 和 0.9 倍数。我希望我这次能理解它！
您需要做两件事：1. 将 sql 中的数据集名称代替“transout”更改为您拥有的任何名称，然后 2. 将宏调用更改为：%calcflag(%str(Total,Mean,Median,StDev));