使用“库（聚合）”计算两个序列中匹配元素的复杂性答案

【问题标题】：Complexity of counting matching elements in two sequences using `library(aggregate)`使用“库（聚合）”计算两个序列中匹配元素的复杂性
【发布时间】：2021-04-28 14:13:42
【问题描述】：

我们想计算恰好代表 DNA 序列的两个（可能很长）字符串之间的对应关系。这些序列是字符列表，其中字符取自a,c,t,g,'_'，其中“_”是一个“不知道”占位符，它从不对应于任何东西，甚至它本身。在这种情况下，我们使用library(aggregate)（感谢 CapelliC 的想法）：

match(Seq1,Seq2,Count) :-
   aggregate_all(count,
      (
         nth1(Pos,Seq1,X),
         nth1(Pos,Seq2,X),
         memberchk(X,[a,c,g,t])
      ),
      N).

这种方法可以与一种“直接”的方法进行比较，在这种方法中，人们将设置一个（尾递归）递归，该递归只是串联地遍历两个序列并成对比较元素，同时计数。

由于序列可能非常大，因此算法复杂性变得有些重要。

人们会期望，n = length(sequence) 并且两个序列的长度相同：

直截了当的方法：复杂度为 O(n)
聚合方法：复杂度为 O(n²)

上述算法的（时间和空间）复杂度是多少？为什么？

测试代码

作为对上述内容的补充，一个基于 SWI-Prolog 的 plunit 测试代码块：

:- begin_tests(atcg).

wrap_match(String1,String2,Count) :-
   atom_chars(String1,Seq1),
   atom_chars(String2,Seq2),
   fit(Seq1,Seq1,0,Count).

test("string 1 empty",nondet) :-
   wrap_match("atcg","",Count), 
   assertion(Count == 0).
      
test("string 2 empty") :-
   wrap_match("","atcg",Count), 
   assertion(Count == 0).

test("both strings empty") :-
   wrap_match("","",Count), 
   assertion(Count == 0).

test("both strings match, 1 char only") :-
   wrap_match("a","a",Count),   
   assertion(Count == 1).

test("both strings match") :-
   wrap_match("atcgatcgatcg","atcgatcgatcg",Count),   
   assertion(MatchCount == 12).

test("both strings match with underscores") :-
   wrap_match("_TC_ATCG_TCG","_TC_ATCG_TCG",Count),   
   assertion(MatchCount == 9).

test("various mismatches 1") :-
   wrap_match("atcgatcgatcg","atcgatcgatcg",Count),   
   assertion(MatchCount == 8).

test("various mismatches with underscores") :-
   wrap_match("at_ga_cg__cg","atcgatcgatcg",Count),   
   assertion(Count == 8).
   
:- end_tests(atcg).

所以：

?- run_tests.
% PL-Unit: atcg ........ done
% All 8 tests passed
true.

【问题讨论】：

重要的是要注意，在引发此线程的问题中，序列可能具有不同的长度。因此我的双 nth1/2 使用...

标签： prolog time-complexity

【解决方案1】：

经验信息

在使用下面的代码进行一些手动数据收集（需要自动化的东西）之后，它会输出经过的时间和向控制台做出的推断次数：

gimme_random_sequence(Length,Seq) :-
   length(Seq,Length),
   maplist(
      [E]>>(random_between(0,3,Ix),nth0(Ix,[a,t,c,g],E)),
      Seq).
           
how_fast(Length) :-
   gimme_random_sequence(Length,Seq1),
   gimme_random_sequence(Length,Seq2),
   time(match(Seq1,Seq2,_)).

...以及 LibreOffice Calc 中的一些图形摸索（我的 ggplot 技能生疏了），我们有经验数据表明该算法的成本是

O((长度(序列))²)。

Count,Inferences,Seconds,milliseconds,megainferences
1000,171179,0.039,39,0.171179
2000,675661,0.097,97,0.675661
3000,1513436,0.186,186,1.513436
4000,2684639,0.327,327,2.684639
5000,4189172,0.502,502,4.189172
6000,6027056,0.722,722,6.027056
7000,8198103,1.002,1002,8.198103
8000,10702603,1.304,1304,10.702603
9000,13540531,1.677,1677,13.540531
10000,16711607,2.062,2062,16.711607
11000,20216119,2.449,2449,20.216119
20000,66756619,8.091,8091,66.756619
30000,150134731,17.907,17907,150.134731
40000,266846773,32.012,32012,266.846773
50000,416891749,52.942,52942,416.891749
60000,600269907,74.103,74103,600.269907

【讨论】：

您是否加载了 library(apply_macros) 以消除元调用和 lambda 表达式开销？
请始终在log-log scale 处绘制此类图，以便准确。 :)
@PauloMoura 啊...不。但这将是一个恒定的乘数。得到 time/1 后我将不得不重试以输出机器可读的数据。
@WillNess 好建议。我想拟合一个多项式（有一个 LibreOffice 插件），但我还没有准备好完成那个教程。
所以结论是，它是二次的； aggregate not 不会自动为我们将两个 sequential nth1s 融合（连接在一起）做自己的解释，我们必须自己做，最终得到member/4，即两个member/2s 在两个输入列表上协同工作，“并行”处理它们。

【解决方案2】：

永远不要在 Prolog 中使用避免回溯的函数式编程习惯用法，例如 maplist/4。这里pair_member/4 和match3/3 应该快一点。

match2(Seq1, Seq2, Count) :-
   (   maplist([X,Y,X-Y]>>true, Seq1, Seq2, Seq3)
   ->  aggregate_all(count, (member(X-X, Seq3), X\='_'), Count)
   ;   Count = 0 ).

pair_member(X, Y, [X|_], [Y|_]).
pair_member(X, Y, [_|L], [_|R]) :-  
    pair_member(X, Y, L, R).

match3(Seq1, Seq2, Count) :-
   aggregate_all(count,
         (pair_member(X, X, Seq1, Seq2), X \= '_'), Count).

gimme_random_sequence(Length, Seq) :-
   length(Seq, Length),
   maplist([E]>>(random_between(0,3,Ix), nth0(Ix, [a,t,c,g], E)), Seq).

test(N) :-
   gimme_random_sequence(N, S1),
   gimme_random_sequence(N, S2),
   time(match2(S1, S2, Count)),
   time(match3(S1, S2, Count)).

哇！它的速度提高了 10 倍！感谢 SWI-Prolog 的天才
编译pair_member/4中的尾递归：

/* SWI-Prolog 8.3.21, MacBook Air 2019 */
?- set_prolog_flag(double_quotes, chars).
true.

?- X = "abc".
X = [a, b, c].

?- match2("_TC_ATCG_TCG","_TC_ATCG_TCG",Count).
Count = 9.

?- match3("_TC_ATCG_TCG","_TC_ATCG_TCG",Count).
Count = 9.

?- test(100000).
% 1,575,520 inferences, 0.186 CPU in 0.190 seconds (98% CPU, 8465031 Lips)
% 175,519 inferences, 0.018 CPU in 0.019 seconds (98% CPU, 9577595 Lips)
true.

编辑 29.04.2021：
哦，具有讽刺意味的是，分岔回溯仍然具有挑战性。
修复library(apply_macros) 的误用后，我得到：

?- test(100000).
% 374,146 inferences, 0.019 CPU in 0.019 seconds (99% CPU, 19379778 Lips)
% 174,145 inferences, 0.014 CPU in 0.014 seconds (99% CPU, 12400840 Lips)
true.

本地成员/2 是否有助于良好的地图列表解决方案性能？
但我应该做一个更好的测量，更长的时间。

开源：

序列匹配问题
https://gist.github.com/jburse/9fd22e8c3e8de6148fbd341817538ef6#file-sequence-pl

【讨论】：

“永远不要在 Prolog 中使用避免回溯的函数式编程习惯用法”。贵宾犬的内核难道不是：“避免构建中间结构，而是使用流处理，其中回溯可以帮助您将较小的相关“数据窗口”保留在堆栈上，同时在第一个分支中为您提供结果并继续往下走另一个分支中的流。”
误用library(apply_macros)的意思是：1）加载library(apply_macros)：use_module(library(apply_macros))，然后2）再次输入代码进行编译。
让我们continue this discussion in chat.

【解决方案3】：

我认为观察到复杂性 O(n²) 不是由于 聚合方法 本身，而是由于事实上，子目标nth1(Pos,Seq1,X), nth1(Pos,Seq2,X) 表现为“嵌套循环”（在序列的大小 n 中）。

因此，应该有可能创建另一种算法，即使使用聚合，也可以具有复杂度O(n)，只要“嵌套循环”被消除。

比较算法

% Original algorithm: Complexity O(n²)

match1(Seq1, Seq2, Count) :-
   aggregate_all(count,
      (  nth1(Pos, Seq1, X),
         nth1(Pos, Seq2, X),
         memberchk(X, [a,c,g,t]) ),
      Count).

% Proposed algorithm: Complexity O(n)

match2(Seq1, Seq2, Count) :-
   (   maplist([X,Y,X-Y]>>true, Seq1, Seq2, Seq3)
   ->  aggregate_all(count, (member(X-X, Seq3), X\='_'), Count)
   ;   Count = 0 ).

gimme_random_sequence(Length, Seq) :-
   length(Seq, Length),
   maplist([E]>>(random_between(0,3,Ix), nth0(Ix, [a,t,c,g], E)), Seq).

test(N) :-
   gimme_random_sequence(N, S1),
   gimme_random_sequence(N, S2),
   time(match1(S1, S2, Count)),
   time(match2(S1, S2, Count)).

简单的实证结果

?- test(10000).
% 16,714,057 inferences, 1.156 CPU in 1.156 seconds (100% CPU, 14455401 Lips)
% 39,858 inferences, 0.000 CPU in 0.000 seconds (?% CPU, Infinite Lips)
true.

?- test(20000).
% 66,761,535 inferences, 4.594 CPU in 4.593 seconds (100% CPU, 14533123 Lips)
% 79,826 inferences, 0.016 CPU in 0.016 seconds (100% CPU, 5108864 Lips)
true.

?- test(40000).
% 266,856,213 inferences, 19.734 CPU in 19.841 seconds (99% CPU, 13522405 Lips)
% 159,398 inferences, 0.016 CPU in 0.015 seconds (104% CPU, 10201472 Lips)
true.

?- test(80000).
% 1,067,046,835 inferences, 77.203 CPU in 77.493 seconds (100% CPU, 13821291 Lips)
% 320,226 inferences, 0.047 CPU in 0.047 seconds (100% CPU, 6831488 Lips)
true.

2021 年 4 月 30 日编辑： nth1(I,S,X), nth1(I,S,X) 真的可以用作嵌套循环吗？

要查看此问题的答案是否为是，请考虑以下 nth/3 的简单实现，它使用全局标志计算找到每个解决方案所需的轮数： p>

nth(Index, List, Item) :-
   (   var(Index)
   ->  nth_nondet(1, Index, List, Item)
   ;   integer(Index)
   ->  nth_det(Index, List, Item)
   ).

nth_det(1, [Item|_], Item) :- !.
nth_det(Index, [_|Rest], Item) :-
   flag(rounds, Rounds, Rounds+1),
   Index1 is Index - 1,
   nth_det(Index1, Rest, Item).

nth_nondet(Index, Index, [Item|_], Item).
nth_nondet(Acc, Index, [_|Rest], Item) :-
   flag(rounds, Rounds, Rounds+1),
   Acc1 is Acc + 1,
   nth_nondet(Acc1, Index, Rest, Item).

要获得回合数，可以问：

?- flag(rounds,_,0), nth(5,[a,b,c,d,e],X), flag(rounds,Rounds,Rounds).
X = e,
Rounds = 4.

现在，使用这个谓词，我们可以创建一个谓词来计算目标nth(I,L,X), nth(I,L,X) 的轮数，用于不同长度的列表：

count_rounds :-
    forall(between(1, 10, N),
           ( Length is 10*N,
             count_rounds(Length, Rounds),
             writeln(rounds(Length) = Rounds)
           )).

count_rounds(Length, _) :-
    numlist(1, Length, List),
    flag(rounds, _, 0),
    nth(Index, List, Item),
    nth(Index, List, Item),
    fail.
count_rounds(_, Rounds) :-
    flag(rounds, Rounds, Rounds).

经验结果：

?- count_rounds.
rounds(10) = 55
rounds(20) = 210
rounds(30) = 465
rounds(40) = 820
rounds(50) = 1275
rounds(60) = 1830
rounds(70) = 2485
rounds(80) = 3240
rounds(90) = 4095
rounds(100) = 5050

正如我们所见，目标nth(I,L,X), nth(I,L,X) 计算了一个 n 阶方阵的一半（包括它的对角线）。因此，长度为 n 的列表的轮数为 rounds(n) = (n² + n)/2.因此，这个目标的时间复杂度是O(n²)。

备注库谓词nth1/3 的实现比谓词nth/3 的实现要高效一些，本实验考虑。尽管如此，目标nth1(I,S,X), nth1(I,S,X)的时间复杂度仍然是O(n²)。

【讨论】：

【解决方案4】：

这是@MostowskiCollapse 回答的后续，我已将 Gertjan van Noord 为 member/2 提供的相同优化应用到 pair_member/4，但我已将其重命名为 member/4。

member(X, Y, [XH|XT], [YH|YT]) :-
  member_(XT, YT, X, Y, XH, YH).

member_(_, _, X,Y, X,Y).
member_([XH|XT],[YH|YT], X,Y, _,_) :-
  member_(XT,YT, X,Y, XH,YH).

match4(Seq1, Seq2, Count) :-
  aggregate_all(count,
      (member(X, X, Seq1, Seq2), X \= '_'), Count).

test(N) :-
  gimme_random_sequence(N, S1),
  gimme_random_sequence(N, S2),
  %time(match2(S1, S2, Count)),
  time(match3(S1, S2, Count)),
  time(match4(S1, S2, Count)).

...

我得到长度为 1.000.000 的列表

% 1,751,758 inferences, 0.835 CPU in 0.835 seconds (100% CPU, 2098841 Lips)
% 1,751,757 inferences, 0.637 CPU in 0.637 seconds (100% CPU, 2751198 Lips)

也就是大约25%的增益...

【讨论】：

我想知道这种加速与 SWI-Prolog 的 Prolog VM 有何关系。是子句索引还是其他？毕竟 member_/6 有更多参数，但 SWI-Prologs 尾递归似乎可以处理它。
IIRC 的收益是由于避免了列表头的双重解包。所以我会说它与 Prolog VM 有关。