SAS - 分组对答案

【问题标题】：SAS - grouping pairsSAS - 分组对
【发布时间】：2015-11-27 08:20:33
【问题描述】：

我有两个变量 ID1 和 ID2。它们都是相同类型的标识符。当它们出现在同一行数据中时，意味着它们在同一组中。我想为每个 ID 创建一个组标识符。例如，我有

那我想要

因为 1,2,4,5,6 在原始数据中通过某种组合配对，所以它们共享一个组。 3 和 7 只相互配对，因此它们是一个新组。我想为约 20,000 行执行此操作。 ID1 中的每个 ID 也在 ID2 中（更具体地说，如果 ID1=1 和 ID2=2 用于观察，那么还有另一个观察是 ID1=2 和 ID2=1）。

我尝试过将它们来回合并，但这不起作用。我还尝试调用 symput 并尝试为每个 ID 组创建一个宏变量，然后在我遍历行时对其进行更新，但我也无法让它工作。

【问题讨论】：

你能运行 Proc BOM，我现在忘记了包名。
我预计，对于 20k 行，基于散列的方法也应该是可行的。
我无法理解 proc bom 的工作原理，你能举个例子说明它对我有什么帮助吗？
@Reeza 它在 SAS/OR 中。
我发布了一个基于哈希的答案 - 请确认这是否符合您的预期。

标签： sas

【解决方案1】：

我以 Haikuo Bian 的回答为起点，开发了一种稍微复杂一些的算法，该算法似乎适用于我迄今为止尝试过的所有测试用例。它可能会进一步优化，但它可以在我的 PC 上在一秒钟内处理 20000 行，同时只使用几 MB 内存。输入数据集不需要按任何特定顺序排序，但正如所写，它假设每一行至少出现一次，id1

测试用例：

/* Original test case */
data have;
input id1 id2;
cards;
1     4
1     5
2     5
2     6
3     7
4     1
5     1
5     2
6     2
7     3
;
run;

/* Revised test case - all in one group with connecting row right at the end */
data have; 
input ID1 ID2; 
/*Make sure each row has id1 < id2*/
if id1 > id2 then do;
t_id2 = id2;
id2   = id1;
id1   = t_id2;
end;
drop t_id2;
cards; 
2 5 
4 8 
2 4 
2 6 
3 7 
4 1 
9 1 
3 2 
6 2 
7 3
;
run;

/*Full scale test case*/
data have;
    do _N_ = 1 to 20000;
        call streaminit(1);
        id1 = int(rand('uniform')*100000);
        id2 = int(rand('uniform')*100000);
        if id1 < id2 then output;
        t_id2 = id2;
        id2   = id1;
        id1   = t_id2;
        if id1 < id2 then output;
    end;
    drop t_id2; 
run;

代码：

option fullstimer;

data _null_;
    length id group 8;
    declare hash h();
    rc = h.definekey('id');
    rc = h.definedata('id');        
    rc = h.definedata('group');
    rc = h.definedone();

    array ids(2) id1 id2;
    array groups(2) group1 group2;

    /*Initial group guesses (greedy algorithm)*/
    do until (eof);
        set have(where = (id1 < id2)) end = eof;
        match = 0;
        call missing(min_group);
        do i = 1 to 2;
            rc = h.find(key:ids[i]);
            match + (rc=0);
            if rc = 0 then min_group = min(group,min_group);
        end;
        /*If neither id was in a previously matched group, create a new one*/
        if not(match) then do;
            max_group + 1;
            group = max_group;
        end;
        /*Otherwise, assign both to the matched group with the lowest number*/
        else group = min_group;
        do i = 1 to 2;
            id = ids[i];
            rc = h.replace();
        end;
    end;

    /*We now need to work through the whole dataset multiple times
      to deal with ids that were wrongly assigned to a separate group
      at the end of the initial pass, so load the table into a 
      hash object + iterator*/
    declare hash h2(dataset:'have(where = (id1 < id2))');
    rc = h2.definekey('id1','id2');
    rc = h2.definedata('id1','id2');
    rc = h2.definedone();
    declare hiter hi2('h2');

    change_count = 1;
    do while(change_count > 0);
        change_count = 0;
        rc = hi2.first();
        do while(rc = 0);
            /*Get the current group of each id from 
              the hash we made earlier*/
            do i = 1 to 2;
                rc = h.find(key:ids[i]);
                groups[i] = group;
            end;
            /*If we find a row where the two ids have different groups, 
              move the id in the higher group to the lower group*/
            if groups[1] < groups[2] then do;
                id = ids[2];
                group = groups[1];
                rc = h.replace();
                change_count + 1;           
            end;
            else if groups[2] < groups[1] then do;
                id = ids[1];
                group = groups[2];
                rc = h.replace();       
                change_count + 1;           
            end;
            rc = hi2.next();
        end;
        pass + 1;
        put pass= change_count=; /*For information only :)*/
    end;    

    rc = h.output(dataset:'want');

run;

/*Renumber the groups sequentially*/
proc sort data = want;
    by group id;
run;

data want;
    set want;
    by group;
    if first.group then new_group + 1;
    drop group;
    rename new_group = group;
run;

/*Summarise by # of ids per group*/
proc sql;
    select a.group, count(id) as FREQ 
        from want a
        group by a.group
        order by freq desc;
quit;

有趣的是，如果 id1 已经匹配，则在初始阶段不检查 id2 组的建议优化实际上在这个扩展算法中减慢了一些速度，因为这意味着在随后的阶段中必须做更多的工作，如果id2 位于编号较低的组中。例如。我之前进行的试运行的输出：

使用“优化”：

 pass=0 change_count=4696
 pass=1 change_count=204
 pass=2 change_count=23
 pass=3 change_count=9
 pass=4 change_count=2
 pass=5 change_count=1
 pass=6 change_count=0

 NOTE: DATA statement used (Total process time):
       real time           0.19 seconds
       user cpu time       0.17 seconds
       system cpu time     0.04 seconds
       memory              9088.76k
       OS Memory           35192.00k

没有：

 pass=0 change_count=4637
 pass=1 change_count=182
 pass=2 change_count=23
 pass=3 change_count=9
 pass=4 change_count=2
 pass=5 change_count=1
 pass=6 change_count=0

 NOTE: DATA statement used (Total process time):
       real time           0.18 seconds
       user cpu time       0.16 seconds
       system cpu time     0.04 seconds

【讨论】：

我已经修复了这个答案中的一个错误，该错误导致应该将单独的 ID 组视为一个大组。
这个解决方案是否只有在每个观察的反向观察出现时才有效？
它似乎工作正常，与我的 python 解决方案相匹配
设置为忽略 id1 >= id2 的行。如果您有一对 id1
1) 谢谢！这真的很有用，我从中学到了很多！ 2）你可以只做一个 dummyid1=min(id1,id2) 和 dummyid2=max(id1,id2)

【解决方案2】：

就像一位评论员提到的那样，哈希似乎确实是一种可行的方法。在下面的代码中，'id' 和 'group' 维护在 Hash 表中，只有在整行没有找到 'id' 匹配时才添加新的 'group'。请注意，“重做”是一个未记录的功能，它可以很容易地用更多的编码替换。

data have;
    input ID1   ID2;
    cards;
1     4
1     5
2     5
2     6
3     7
4     1
5     1
5     2
6     2
7     3
;

data _null_;
    if _n_=1 then
        do;
            declare hash h(ordered: 'a');
            h.definekey('id');
            h.definedata('id','group');
            h.definedone();
            call missing(id,group);
        end;

    set have end=last;
    array ids id1 id2;
    do over ids;
        rc=sum(rc,h.find(key:ids)=0);

        /*you can choose to 'leave' the loop here when first h.find(key:ids)=0 is met, for the sake of better efficiency*/
    end;

    if not rc > 0 then
        group+1;

    do over ids;
        id=ids;
        h.replace();
    end;
if last then rc=h.output(dataset:'want');
run;

【讨论】：

这或多或少是我会发布的答案。但是，请注意，它取决于输入数据集的排序顺序 - 例如交换 obs 1 和 3 会导致输出不正确。此外，您可以通过仅使用 id1 < id2 读取 obs 来进一步优化，因为提问者说每对存在两次。
另一个示例 - 按 descending ID1 ID2 对数据集 want 进行排序会产生 3 个组。我认为散列概念是可行的，但需要进一步通过来合并起初看起来分开的组，直到找到链接观察。
我基本上同意你的第一条评论，这种数据操作总是很混乱，所以它必须有某种业务规则或至少在一个 ID 上进行预排序，然后就可以了.对于您的第二个 cmets，如果它只有两个 ID，我同意，但是，注意到 asker 发布的后续评论，它可能有“多于 2 个”，所以我认为这种优化不值得额外的编码工作.
谢谢！所以我需要先按 ID1 再按 ID2 排序才能正常工作？
当我把数据改成：数据有；输入ID1 ID2；牌; 2 5 4 8 2 4 2 6 3 7 4 1 9 1 3 2 6 2 7 3 ;它不再起作用了。即 5 属于第 1 组，即使他们都在同一组中。它永远不会更新。我想我的例子不遵循每个观察结果的反向规则也在数据中。这会有所作为吗？

【解决方案3】：

请尝试以下代码。

data have;
input ID1 ID2;
datalines;
1     4
1     5
2     5
2     6
3     7
4     1
5     1
5     2
6     2
7     3
;
run;

* Finding repeating in ID1;

proc sort data=have;by id1;run;

data want_1;

    set have;
    by id1;

    attrib flagrepeat length=8.;

    if not (first.id1 and last.id1) then flagrepeat=1;
    else flagrepeat=0;
run;

* Finding repeating in ID2;

proc sort data=want_1;by id2;run;

data want_2;
    set want_1;
    by id2;

    if not (first.id2 and last.id2) then flagrepeat=1;

run;

proc sort data=want_2 nodupkey;by id1 ;run;

data want(drop= ID2 flagrepeat rename=(ID1=ID));
    set want_2;
    attrib Group length=8.;

    if(flagrepeat eq 1) then Group=1;
    else Group=2;
run;

希望这个答案有帮助。

【讨论】：

感谢您的回复。不幸的是，我认为这不会奏效，因为在决赛中会有超过 2 组。我认为您的解决方案适用于 2 组是否正确？
@DVL，是的，此代码仅适用于两组。您能否提供更多组的场景。
例如，可能有任意数量的组。 1 2 / 2 1 / 3 4 / 4 3 / 5 6 / 6 5。然后第 1 组 1,2，第 2 组 3,4，第 3 组 5,6