基于匹配的字符串部分有效地加入/合并答案

【问题标题】：Efficiently joining/merging based on matching part of a string基于匹配的字符串部分有效地加入/合并
【发布时间】：2013-09-09 23:40:02
【问题描述】：

我正在尝试根据第一个表中的字符串是否包含在第二个表中的长字符串的一部分中来连接两个表。我在 SAS 中使用 PROC SQL，但也可以使用数据步骤而不是 SQL 查询。

此代码在较小的数据集上运行良好，但很快就会陷入困境，因为它必须进行大量比较。如果是简单的相等检查就好了，但必须使用index() 函数就很难了。

proc sql noprint;
  create table matched as
  select A.*, B.* 
  from  search_notes as B,
        names as A
  where index(B.notes,A.first) or 
        index(B.notes,A.last)
  order by names.name, notes.id;
quit;
run;

B.notes 是一个 2000 个字符（有时完全填充）的文本块，我正在寻找包含 A 中名字或姓氏的任何结果。

我认为分两步执行它并没有获得任何速度优势，因为它已经必须将 A 的每一行与 B 的每一行进行比较（因此检查名字和姓氏并不是瓶颈） .

当我运行它时，我的日志中会出现NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.。使用 A=4000 个观察值和 B=100,000 个观察值运行它需要 30 分钟才能产生约 1000 个匹配项。

有什么办法可以优化吗？

【问题讨论】：

为了使它更像 SQL，我尝试在 A.First 和 A.Last 之前和之后添加 % ，然后使用 where B.notes LIKE A.first 并产生相同的注释和相同的长运行时间。我希望使用 SQL 特性而不是 SAS 函数可以优化它，但我想不会。
您是在尝试左连接还是内连接，还是真正的笛卡尔积？当 B.NOTES 包含任一字段时，您是否只想将 B 中的数据连接到 A 上？
是的，我只想要 A 中的字段包含在 B.notes 中的结果集（接受任何一方可能有多个结果，因为多个事物可以匹配）
好的。让我玩弄它。这是一个棘手的问题。这在任何 SQL 处理器中都是一个问题，而不仅仅是 SAS。
如果它更容易，我可以确保 A.first 和 A.last 在 A 中都是唯一的（两者都是唯一的，因此也是唯一的组合）

标签： sql sas

【解决方案1】：

笛卡尔积可能最适合您的数据，但您可以尝试以下方法。我正在做的是在数据步骤中使用 CALL EXECUTE() 将步骤匹配构建到数据步骤中。这意味着您只需遍历每个表一次。但是，您将在写入数据步骤中有 4000 个 IF/THEN 子句。这样做会使我的示例数据的运行时间从 55 秒缩短到 40 秒。如果该比率成立，这将比您的 30 分钟减少约 24 分钟。

我会留下这个问题。也许有人能想出更好的方法。

%let n=50;
data B;
format notes $&n..;
choose = "ABCDEFGHIJLKMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
do j=1 to 9000000;
    notes = "";
    do i=1 to floor(5 + ranuni(123)*(&n-5));
        r = floor(ranuni(123)*62+1);
        notes = catt(notes,substr(choose,r,1));

    end;
    output;
    drop r choose i;
end;
run;

data a;
choose = "ABCDEFGHIJLKMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
format first last $2.;
do i=1 to 62 by 2;
    first = strip(substr(choose,i,1));
    first = catt(first,first);
    last =  strip(substr(choose,i+1,1));
    last = catt(last,last);
    output;
end;
drop choose ;
run;

proc sql noprint;
  create table matched as
  select A.*, B.* 
  from  B as B,
        A as A
  where index(B.notes,A.first) or 
        index(B.notes,A.last)
  order by B.notes, a.i;
quit;

options nosource;
data _null_;
set a end=l;
if _n_ = 1 then do;
    call execute("data matched2; set B;");
    call execute("format First Last $2. i best.;");
end;

format outStr $200.;
outStr = "if index(notes,'" || first || "') or index(notes,'" || last || "') then do;";
call execute(outStr);

outStr = "first = '" || first || "';";
call execute(outStr);
outStr = "last = '" || last || "';";
call execute(outStr);
outStr = "i = " || i || ";";
call execute(outStr);
call execute("output; end;");

if l then do;
    call execute("run;");
end;
run;

proc sort data=matched2;
by notes i;
run;

【讨论】：

【解决方案2】：

这听起来不太适合 PROC SQL。如果我理解正确，您想将search_notes 中的每一行与names 中的每一行进行比较（因此是笛卡尔积）。更传统的数据步骤程序可能更容易理解，也可能更高效：

data matched;
   set search_notes;
   do _i_=1 to nobs;
      set names point=_i_ nobs=nobs;
      if index(notes,first) 
      or index(notes,last) then output;
      end;
   drop _i_;
run;
proc sort data=matched;
   by vendor_name, claimant_id;
run;

【讨论】：

这可能不会比 SQL 解决方案更好甚至更差。这相当于我所做的，但我将内部循环展平以减少开销。

【解决方案3】：

这是一个部分答案，使其运行速度提高了 4-5 倍，但并不理想（在我的情况下它有帮助，但在优化笛卡尔积连接的一般情况下不一定有效）。

我最初有 4 个单独的 index() 语句，就像在我的示例中一样（我的简化示例有 2 个用于 A.first 和 A.last）。

我能够将所有 4 个 index() 语句（加上我要添加的第 5 个）重构为解决相同问题的正则表达式。它不会返回相同的结果集，但我认为它实际上返回 比 5 个单独的索引更好的结果，因为您可以指定单词边缘。

在我清理名称以进行匹配的数据步骤中，我创建了以下模式：

pattern = cats('/\b(',substr(upcase(first_name),1,1),'|',upcase(first_name),').?\s?',upcase(last_name),'\b/');

这应该按照/\b(F|FIRST).?\s?LAST\b/ 的行创建一个正则表达式，它将匹配 F.Last、First Last、flast@email.com 等任何内容（有些组合它不支持，但我只是关注我在数据中观察到的组合）。使用 '\b' 也不允许 FLAST 恰好与单词的开头/结尾相同（例如“Edward Lo”与“Eloquent”匹配），我发现使用 index() 很难避免这种情况

然后我像这样执行我的 sql 连接：

proc sql noprint;
create table matched as
  select  B.*, 
          prxparse(B.pattern) as prxm, 
          A.* 
  from  search_text as A,
        search_names as B
  where prxmatch(calculated prxm,A.notes)
  order by A.id;
quit;
run;

能够在 B 中为每个名称编译一次正则表达式，然后在 A 中的每段文本上运行它似乎比几个索引语句快得多（不确定正则表达式与单个索引的情况)。

使用 A=250,000 Obs 和 B=4,000 Obs 运行它， index() 方法需要大约 90 分钟的 CPU 时间，而使用 prxmatch() 执行相同的操作只需要 20 分钟的 CPU 时间。

【讨论】：