论文阅读：[2019 TSE] A Theoretical and Empirical Analysis of Program Spectra Diagnosability

文章目录

前言
1 基本信息

1.1 Rui Abreu简介
1.2 第一作者Alexandre Perez简介

2 文章内容
3 ==及时调整策略（全新的论文阅读模式）==
4 QA环节

4.1 Questions
4.2 Answers

4.2.1 回答问题一
4.2.2 回答问题2

4.2.3 回答问题三

5 不得不提的优秀之处
6 不足之处
总结

前言

今早看到Gmail推送的这篇论文反复出现在new citations中，倍觉诧异，细看又是顶刊TSE接受的文章。所以认定是非常值得细读的，所以在此阅读该论文：
A Theoretical and Empirical Analysis of Program Spectra Diagnosability

1 基本信息

先给出文章下载地址 ：额，还没有公布出来。没有open access网址。所以的话，有两种方法：1）登录学校图书馆下载；2）sci-hub应该也可以下载。

作者列表 让我惊讶，意料之外但是又是预料之中，又见大佬。首先看作者：
Alexandre Perez, Member, IEEE；
Rui Abreu, Senior Member, IEEE；
Arie van Deursen, Member, IEEE。

Rui Abreu 这位可以说是大名鼎鼎了，我以前应该是专门介绍过这位学者的。这里再复习一遍。（以前的记不得也就记不得了吧，大不了再学一遍）

1.1 Rui Abreu简介

1）学术成就： 下图是作者的谷歌学术首页：https://scholar.google.com/citations?user=x25BFgEAAAAJ&hl=zh-CN&oi=ao

这些红框文章，我基本上都读过。此外，他的引用总计3591。可见其在学术界的影响力。

熟悉的GZoltar缺陷定位工具也是他开发的。。。

论文阅读：[2019 TSE] A Theoretical and Empirical Analysis of Program Spectra Diagnosability
2）个人主页： 他的个人主页在：http://www.ruimaranhao.com/

看了主页后，第一感觉：很geek。
第二感觉，很爱刷Twitter（当然也不是很，就是每天都会发Twitter，转发Twitter之类的）。
感觉他在Twitter上关注的也常常是一些计算机，软件工程领域相关的新闻。很厉害。
论文阅读：[2019 TSE] A Theoretical and Empirical Analysis of Program Spectra Diagnosability
3）研究方向： 基于谱的缺陷定位技术（SFL）。
还有：Software Engineering，Debugging and Testing，Machine Learning，Green Computing，Security

下一篇博客，我会把他的SFL技术的文章都看一遍， 尤其是：A qualitative reasoning approach to spectrum-based fault localization 这篇将SFL和更多信息结合的文章。跳出来改进缺陷定位的想法是对的。只有跳出来，才能改进这种局限性。

1.2 第一作者Alexandre Perez简介

其谷歌学术主页：https://scholar.google.com/citations?user=eARG7zYAAAAJ&hl=zh-CN&oi=sra

讲真，是个很厉害的人。而且是个厚积薄发、类似后期强势爆发型的人才，值得笔墨记之。

首先，2012年开始读博，6年博士。
论文阅读：[2019 TSE] A Theoretical and Empirical Analysis of Program Spectra Diagnosability

一开始真的没发什么文章，但是2016年之后，发的好会就开始多了起来：
A test-suite diagnosability metric for spectrum-based fault localization approaches （ICSE 2017）
Prevalence of single-fault fixes and its impact on fault localization（ICST 2017 这个虽然是ccf c会，但是感觉挺难中的。感觉有含金量）
Leveraging Qualitative Reasoning to Improve SFL（IJCAI 2018）
A qualitative reasoning approach to spectrum-based fault localization（ICSE 2018）
A Theoretical and Empirical Analysis of Program Spectra Diagnosability（TSE 2019）

这个是很难达到的成就。都是一作。
看来是学术之路，坚持二字，而后有所突破有所领悟，渐入佳境啦。

这些文章都写的挺好的，感觉都可以看看。

下下篇博客就研究这位学者的文章。

2 文章内容

抛出问题：

Current metrics for assessing the adequacy of a test-suite plainly focus on the number of components (be it lines, branches, paths) covered by the suite, but do not explicitly check how the tests actually exercise these components and whether they provide enough information so that spectrum-based fault localization techniques can perform accurate fault isolation.

怎么解决问题的：

We propose a metric, called DDU, aimed at complementing adequacy measurements by quantifying a test-suite’s diagnosability, i.e., the effectiveness of applying spectrum-based fault localization to pinpoint faults in the code in the event of test failures.

进一步解释自己怎么解决问题的：

Our aim is to increase the value generated by creating thorough test-suites, so they are not only regarded as error detection mechanisms but also as effective diagnostic aids that help widely-used fault-localization techniques to accurately pinpoint the location of bugs in the system.

metric的效果：

We have performed a topology-based simulation of thousands of spectra and have found that DDU can effectively establish an upper bound on the effort to diagnose faults. Furthermore, our empirical experiments using the Defects4J dataset show that optimizing a test suite with respect to DDU yields a 34% gain in spectrum-based fault localization report accuracy when compared to the standard branch-coverage metric.

简单概括一下，文章的意思就是：
当前评估测试用例集的充分性的度量通常只是简单的关注程序components（比如：行，分支，路径）的被覆盖程度，但是没有明确的检查测试用例是怎么执行这些components 或者测试用例是不是提供了足够多的信息来帮助SFL技术进行定位缺陷所在。

所以呢，我们提出了一个度量，叫做DDU。旨在通过量化一个测试用例集的可诊断性，来补足充分性的度量。这个DDU度量呢，也就是利用SFL技术来准确定位失败测试中的缺陷的效率。

（看了前面的DDU度量，确实一脸懵，不知道是什么，要干什么。下面这段话可以说是锦上添花，画龙点睛，顿解我之疑惑。顶刊确实是顶刊，这个写作手法，让我外行都能看懂。）我们的目标就是通过创建充分的测试用例集，使得他们不仅可以用来暴露错误，还可以用来帮助当前的缺陷定位技术来定位缺陷。（这样一说，我觉得真的挺有意义的。）

在实证分析中，我们进行了一个对上千个程序谱的基于拓扑学的模拟，并且发现DDU可以高效建立一个检测错误的上界。更进一步，我们在defects4j上的实证实验表明：和标准的分支覆盖度量相比，根据DDU来做优化的测试用例集在SFL准确度报告中提高了34%。

很佩服顶刊的写作手法。

3 及时调整策略（全新的论文阅读模式）

其实到这里（也就是一字一句的分析abstract），我就已经明白了这篇文章的核心内容了。

这个时候，只需要再提出几个自己感兴趣的问题，或者疑惑，然后再选择性去文章中找答案就可以了，这样的带目的的阅读论文是较为高效的。

我认为这个可以作为一个论文阅读模式来坚持。感觉这是目前根据自己以往经验得出的最高效的论文阅读方法。如果坚持得当，推而广之到学习、实验中，应该也是一样的道理。

4 QA环节

4.1 Questions

我感兴趣的一些问题：

1）作者diss的当前度量都有哪些？（我比较好奇）
2）作者的simulation是怎么进行的？（为什么不做实验，而是去模拟呢？不是有defects4j吗，还要模拟吗？）
3）这个度量的进步到底有多大？？我认为这个度量只不过是在SFL缺陷定位技术中，加入了对程序理解的关注，但是对程序理解的关注很多缺陷定位技术都有做到，比如切片技术就对语义有一定理解。所以这个到底有多大贡献呢？我期待自己进一步的阅读。（我当前的理解肯定是错的，我隐隐感觉二者的侧重点应该不一样，因为这个度量也没有说关心语义之类的，而是完全和SFL这个具体的缺陷定位技术结合了起来，感觉思路很清奇，有点另辟蹊径来提高SFL的感觉。进一步深深感受到SFL已到瓶颈，本来就是一个不精确的、有局限性的统计方法。想提高已经很难。总言之，这个问题提出来是有意义的。）

4.2 Answers

4.2.1 回答问题一

THIS paper discusses the importance of measuring diagnosability of software, i.e., the ability of a program and its test suite to effectively and accurately locate faults when errors arise. It proposes DDU, a new metric for evaluating the diagnosability of a test-suite when applying spectrumbased fault localization approaches

写作角度真的很刁钻。是针对：度量软件可诊断性的角度出发的。
可诊断性什么意思呢？即：错误发生时，程序及其测试用例集高效、准确定位错误的能力。

我有点怀疑，software diagnosability这是作者自己给的概念。我试着搜索一下Diagnosability，看有没有相关工作，结果还是有的：

论文阅读：[2019 TSE] A Theoretical and Empirical Analysis of Program Spectra Diagnosability
答案：

Current test quality metrics quantitatively describe how close a test-suite is to thoroughly exercising a system according to an adequacy criterion. Such criteria describe what characteristics of a program must be exercised. Examples of current metrics include branch and path coverage [1], modified decision/condition coverage [2], and mutation coverage [3]

diss的就是这些测试质量度量（充分性度量）。

然而，作者还进一步diss了当前的可诊断性度量。厉害。
论文阅读：[2019 TSE] A Theoretical and Empirical Analysis of Program Spectra Diagnosability

作者提出的度量，也是一种整合：
论文阅读：[2019 TSE] A Theoretical and Empirical Analysis of Program Spectra Diagnosability
看来大流确实是整合。海纳百川，互补为王。

4.2.2 回答问题2

答：这里就简单说下simulation和defects4j的实验吧。

首先，simulation是theoretical evaluation。

To measure the effectiveness of the proposed metric, we perform theoretical and empirical evaluations. The theoretical evaluation simulates a vast breadth of software systems and test suite compositions so that the range of DDU values can be effectively generated and analyzed in a holistic manner

至于，defects4j的实验，则是 empirical evaluation。

We also empirically evaluate DDU by generating test suites for real-world faulty software projects. Test generation, facilitated by the EVOSUITE tool, is guided to optimize test suites regarding a specific metric, and oracles are generated from correct project versions.

4.2.3 回答问题三

夜深了。2019年2月9日00:47:39
所以不多说了。
这个在introduction里面就能找到答案。

贡献，肯定是有的。
1）对当前充分性度量的一个互补；
2）对当前可诊断性度量的一个整合。

其他的还有待进一步研究理解。

5 不得不提的优秀之处

1）文章是根据ICSE 2017文章工作扩展而来：

This paper extends our previous work [10] by (1) providing a generalization to the information-theoretic reasoning behind targeting a certain optimal spectrum density value, (2) providing a large-scale evaluation of DDU through a topology-based program spectra simulation — so that we are able to generate and analyze a vast breadth of qualitatively distinct faulty spectra —, (3) expanding our evaluation by comparing the diagnostic effectiveness of DDU versus mutation coverage, and (4) expanding our discussion on the implications of using the DDU metric for assessing diagnosability.

2）写法非常到位，值得学习。

6 不足之处

理论上肯定是有不足之处的，但是现在我对这篇文章的一些概念，一些领域知识，理解的还不够。比如：软件可诊断性竟然和SFL缺陷定位有关，我以前都不知道软件可诊断性是什么。

不是完全理解的情况下，自然想不到很多不足的点。

我只是觉得：
1）说到底还是SFL结合测试用例生成，来提高缺陷定位的精度。
而且我觉得最不好的地方是：目测感觉根本就不实用好吗。。。
因为：

We also empirically evaluate DDU by generating test suites for real-world faulty software projects. Test generation, facilitated by the EVOSUITE tool, is guided to optimize test suites regarding a specific metric, and oracles are generated from correct project versions.

测试用例的生成是用evosuite工具生成的，这个我没意见，但是oracles are generated from correct project versions，整个test oracle是需要参照正确的程序来生成，但是实际应用中，哪来正确参考程序？？？

这个让我费解，有可能是我对evosuite这一类测试用例生成工具还不够了解。

（我也觉得是不够理解，我一直想的是，除了暴露crash，evosuite的用处到底在哪里呢？看来我得多去读论文了，很尴尬。）

2）metric不是独创，而是三合一。
有集大成的感觉，但是我感觉这样的趋势是不是表明，缺陷定位，软件测试用例生成这领域，是不是已经到瓶颈了？只能靠combination来缓慢推进。

未来路在何方，还得继续探索。

总结

这一篇文章从2月8号的晚上9点看到2019年2月9日00:48:58
确实不容易，

但是这种论文阅读模式的发现还是很让我满意的，
以后可以坚持下去，我感觉应该会有比较好的效果。

写博客确实也不能断，既是自己的坚持，也是在公开平台写些文章，一来可能对一些同仁有些许用处，二来也是对自己的鞭策激励。

其实建一个自己的github博客也不是不可以，但是传图片太麻烦了，我不太习惯，还是CSDN博客最符合我的需求，图片直接复制粘贴，可以说非常给力了。

Github博客的话，可能以后会建吧。（之前就尝试过hexo+github的博客框架，不过写了几篇就放弃了，现在都不知道到哪里去了。）