SAS --- 基于主题 ID 和日期合并数据（文件内和跨文件）答案

【问题标题】：SAS--- Merge Data Based On subject ID and Date (Within and Across Files)SAS --- 基于主题 ID 和日期合并数据（文件内和跨文件）
【发布时间】：2019-12-15 16:45:37
【问题描述】：

这是两个示例 SAS 数据集（它们都是假数据集）

我在这里没有区分过滤/主，但是创建过滤器很容易（任何将数据集限制为较小的随机过滤器都可以作为示例，即使您不区分它也可以工作，只需复制并将一个重命名为master，另一个重命名为filtered）

我有两个这样的数据集，研究院际转移我的数据集非常庞大。我可以在 SAS 中完成所有工作，但速度非常非常慢 :((( 我将在这里展示我的代码，但我正在寻找提高运行时间的方法。

  master_inpatient
   ID   admsn_dt    thru_dt      prvdr_num       
    341   2013-04-01  2013-04-02    G
    230   2013-06-01  2013-06-03    I
    232   2013-07-31  2013-07-31    F
    124   2013-04-29  2013-04-29    C
    232   2013-07-31  2013-08-20    Q

  filtered_inpatient
   ID   admsn_dt    thru_dt      prvdr_num       
    341   2013-04-01  2013-04-02    G
    232   2013-07-31  2013-07-31    F
    232   2013-07-31  2013-08-20    Q

   master_outpatient
   ID     thru_dt     prvdr_num
    348   2013-09-23   Z
    124   2013-04-29   A
    331   2013-06-14   G
    439   2013-02-01   B
    331   2013-06-14   D

   filtered_outpatient
   ID     thru_dt     prvdr_num
    124   2013-04-29   A
    331   2013-06-14   G
    439   2013-02-01   B
    331   2013-06-14   D

我有两个主数据集：住院患者数据集和门诊患者数据集，和两个过滤数据集：对主数据集应用一些诊断（例如，仅包括诊断为结核病的患者）的过滤器，使数据集比主数据集更短。

ID 是患者 ID，admsn_dt 是您入院的日期，thru_dt 是您出院/转院的日期。门诊只有一个thru_dt，因为在门诊环境中您不需要住院接受治疗。

考虑INPATIENT 和OUTPATIENT 数据集之间可能发生的四种类型的传输。

门诊环境 (ER) 到住院环境，
从住院环境到门诊环境 (ER)，
从门诊环境 (ER) 到门诊环境 (ER)，以及，
从住院环境到住院环境 (ER)。

我希望过滤后的数据集（filtered_inpatient 或filtered_outpatient）作为来源，主数据集（master_inpatient 和master_outpatient）作为目的地，因为患者需要对某些诊断感到满意，然后我们关心的是他/她在哪里转移（患者不需要在目的地进行该诊断）

总而言之：四种传输类型是：

如果门诊→住院：filtered_outpatient(ID, thru_dt) → master_inpatient(ID, admsn_dt)
如果门诊→门诊：filtered_outpatient(ID, thru_dt) → master_outpatient(ID,thru_dt)
如果住院病人 → 住院病人：filtered_inpatient(ID, thru_dt) → master_inpatient(ID,admsn_dt)
如果住院→门诊：filtered_inpatient(ID, thru_dt) → master_inpatient(ID,thru_dt)

我想做的是根据这些条件创建一个同一天/第一天转账的数据集：

对于每个人，如果prvdr_num（提供者编号）不同，并且日期差异小于 1 天（0 或 1）。
计算transtype，指示传输类型。例如：从住院到门诊是inpout。其他为inpinp、outout 和outinp

最终的数据集应该如下所示：

   df3
   ID   fromdate     todate     from_prvdr  to_prvdr    d     transtype
    124   2013-04-29   2013-04-29  C           A          0      inpout
    232   2013-07-31   2013-07-31  F           Q          0      inpinp
    331   2013-06-14   2013-06-14  G           D          0      outout

另一件事是，当在文件中匹配时，你很可能会得到这样的结果：

ID   fromdate     todate       from_prvdr    to_prvdr
1    3/30/2011    3/31/2011    43291         48329
1    3/31/2011    3/30/2011    48329         43291

OR 

ID   fromdate     todate       from_prvdr    to_prvdr
1    3/31/2011    3/31/2011    43291         48329
1    3/31/2011    3/31/2011    48329         43291

(In this latter case I can just exclude duplicate by date later in R, but I need to get rid of the first case)

这是我尝试过的（并且成功了）。

#this is an example of outpatient--> inpatient
#all variables in master datasets have an i prefix

proc sort data= etl.master_inpatient;
    by iID iadmsn_dt; 
run;

proc sort data= etl.filtered_outpatient;
    by ID thru_dt; 
run;

data fnl.matchdate_inpinp;
   set etl.master_inpatient end = eof;
      do p = 1 to num;
         set etl.filtered_outpatient nobs = num point = p;
         if iID = ID then do;
            d = abs(iadmsn_dt-thru_dt);
            put iID = ID = iadmsn_dt = thru_dt= d =;

         if d <= 1 then output;
         end;
         else continue;
      end;
      put '===========================';
   if eof then stop;
run;

代码没有错误，但是我必须对四种类型的传输单独执行此操作，然后在R中将它们合并在一起。我花了两天多的时间才跑完一年的数据，我真的想要更高效的东西，因为我有 8 年的数据。

问：哪些技术可以加快创建传输数据集的过程？

另外，正如我所说，在文件内匹配时，我们很可能会得到一些重复的结果（如上所述），我真的希望可以解决这个问题。

【问题讨论】：

庞大的数据集中有多少行和多少列？数据集上有索引吗？

标签： loops date join merge sas

【解决方案1】：

因此，对于主住院文件中的每条记录，您都在遍历整个过滤的门诊数据集，该数据集本质上是一个笛卡尔积，因此需要大量时间。我可以看到多个可以提高效率的地方：

您可以将set etl.filtered_outpatient nobs = num point = p; 更改为set etl.filtered_outpatient (where = (Id = iID and (iadmsn_dt-thru - 1 <= thru_dt <= iadmsn_dt-thru + 1))) nobs = num ; 并删除下面的if 条件。您现在基本上是在循环，尽管只有 ID 相等且日期在 1 天范围内的记录，这将获得一些效率。如果可以为filtered_outpatient 在ID 和thru_dt 上建立索引，它会变得非常快。
您可以使用PROC SQL笛卡尔积，它会更快一些（这是基于我的经验）

create table matchdate_inpinpas    
    select a.*, b.* /* Or what columns you need to keep */
    from master_inpatient a, filtered_outpatient b
    where b.Id = aiID and (a.iadmsn_dt-thru - 1 <= b.thru_dt <= a.iadmsn_dt-thru + 1);
quit;

同样，索引将使这非常快。

您可以按部分拆分并并行提交。一种简单的方法是根据 ID 的第一个数字将其分解为 10 个部分（假设它们具有均匀分布 - 否则不同的部分将需要不同的时间来运行）。

所以你可以编写一个如下所示的宏：

%macro process (part);

create table matchdate_inpinpas_part&p
    select a.*, b.* /* Or what columns you need to keep */
    from master_inpatient a, filtered_outpatient b
    where b.Id = aiID and 
          (a.iadmsn_dt-thru - 1 <= b.thru_dt <= a.iadmsn_dt-thru + 1)
          and substr(ID, 1, 1) = "&pt";
          /* If ID is numeric you may need a type conversion */
quit; 

%mend.

%part(0); /* Call multiple times with different values */

您将把它扩展到 10 多个部分并大大加快速度。

【讨论】：