重复循环大量对象时如何优化性能答案

【问题标题】：How to optimize performance when repeatedly looping over a big list of objects重复循环大量对象时如何优化性能
【发布时间】：2016-07-10 13:40:22
【问题描述】：

我有一个简单的文件，其中每行包含两个整数值（一个源整数和一个目标整数）。每条线代表两个值之间的关系。该文件未排序，实际文件包含大约 400 万行。排序后可能是这样的：

sourceId;targetId   
1;5    
2;3   
4;7  
7;4  
8;7  
9;5

我的目标是创建一个新对象，该对象将表示具有唯一标识符的列表中所有唯一相关的整数。这个例子的预期输出应该是以下三个对象：

0, [1, 5, 9]  
1, [2, 3]  
2, [4, 7, 8]

所以 groupId 0 包含一组关系（1、5 和 9）。

以下是我目前创建这些对象列表的方法。 Relation 对象列表包含内存中的所有行。 GroupedRelation 列表应该是最终结果。

public class GroupedRelationBuilder {

    private List<Relation> relations;
    private List<GroupedRelation> groupedRelations;
    private List<String> ids;
    private int frameId;

    public void build() {
        relations = new ArrayList<>();
        relations.add(new Relation(1, 5));
        relations.add(new Relation(4, 7));
        relations.add(new Relation(8, 7));
        relations.add(new Relation(7, 4));
        relations.add(new Relation(9, 5));
        relations.add(new Relation(2, 3));

        // sort
        relations.sort(Comparator.comparing(Relation::getSource).thenComparing(Relation::getTarget));

        // build the groupedRelations
        groupId = 0;
        groupedRelations = new ArrayList<>();
        for (int i = 0; relations.size() > 0;) {
            ids = new ArrayList<>();
            int compareSource = relations.get(i).getSource();
            int compareTarget = relations.get(i).getTarget();
            ids.add(Integer.toString(compareSource));
            ids.add(Integer.toString(compareTarget));               
            relations.remove(i);
            for (int j = 0; j < relations.size(); j++) {
                int source = relations.get(j).getSource();
                int target = relations.get(j).getTarget();
                if ((source == compareSource || source == compareTarget) && !ids.contains(Integer.toString(target))) {
                    ids.add(Integer.toString(target));                      
                    relations.remove(j);
                    continue;
                }
                if ((target == compareSource || target == compareTarget) && !ids.contains(Integer.toString(source))) {
                    ids.add(Integer.toString(source));                      
                    relations.remove(j);
                    continue;
                }
            }
            if (relations.size() > 0) {
                groupedRelations.add(new GroupedRelation(groupId++, ids));
            }
        }
    }

    class GroupedRelation {
        private int groupId;
        private List<String> relatedIds;

        public GroupedRelation(int groupId, List<String> relations) {
            this.groupId = groupId;
            this.relatedIds = relations;
        }

        public int getGroupId() {
            return groupId;
        }

        public List<String> getRelatedIds() {
            return relatedIds;
        }
    }

    class Relation {
        private int source;
        private int target;

        public Relation(int source, int target) {
            this.source = source;
            this.target = target;
        }

        public int getSource() {
            return source;
        }

        public void setSource(int source) {
            this.source = source;
        }

        public int getTarget() {
            return target;
        }

        public void setTarget(int target) {
            this.target = target;
        }
    }
}

当我运行这个小示例程序时，创建 1000 个 GroupedRelation 对象需要 15 秒。创建 100 万个 GroupedRelation 需要 250 分钟。

我正在寻求帮助以优化我的代码，该代码确实得到了我想要的结果，但只是需要很长时间。

是否可以优化迭代，使预期结果相同，但获得预期结果所需的时间显着减少？如果可以，你会怎么做？

【问题讨论】：

您可能想看看数据结构/算法的不相交集/联合查找/合并查找类型，请参阅Wikipedia。具有路径压缩的实现具有（几乎）线性复杂性。
我会在O(n) 的时间内一次性完成此操作，构建一个要收集的 id 树

标签： java performance list loops

【解决方案1】：

由于ids.contains 步骤，当前的实现速度很慢。 ArrayList.contains方法的时间复杂度为O(n)：检查它是否包含一个元素，它会一个一个地检查元素，在最坏的情况下扫描整个列表。

如果将ids 的类型从List<String> 更改为Set<String>，并使用HashSet<String> 实例，可以大大提高性能。 Set.contains 实现的预期时间复杂度为 O(1)，与列表相比明显更快。

【讨论】：

【解决方案2】：

我会尽可能地尝试从源代码中一次性完成。

import java.io.*;
import java.util.*;

/**
 * Created by peter on 10/07/16.
 */
public class GroupedRelationBuilder {

    public static List<List<Integer>> load(File file) throws IOException {
        Map<Integer, Group> idToGroupMap = new HashMap<>();
        try (BufferedReader br = new BufferedReader(new FileReader(file))) {
            br.readLine();
            for (String line; (line = br.readLine()) != null; ) {
                String[] parts = line.split(";");
                Integer source = Integer.parseInt(parts[0]);
                Integer target = Integer.parseInt(parts[1]);
                Group grp0 = idToGroupMap.get(source);
                Group grp1 = idToGroupMap.get(target);
                if (grp0 == null) {
                    if (grp1 == null) {
                        Group grp = new Group();
                        List<Integer> list = grp.ids;
                        list.add(source);
                        list.add(target);
                        idToGroupMap.put(source, grp);
                        idToGroupMap.put(target, grp);
                    } else {
                        grp1.ids.add(source);
                        idToGroupMap.put(source, grp1);
                    }
                } else if (grp1 == null) {
                    grp0.ids.add(target);
                    idToGroupMap.put(target, grp0);
                } else {
                    grp0.ids.addAll(grp1.ids);
                    grp1.ids = grp0.ids;
                }
            }
        }
        Set<List<Integer>> idsSet = Collections.newSetFromMap(new IdentityHashMap<>());
        for (Group group : idToGroupMap.values()) {
            idsSet.add(group.ids);
        }
        return new ArrayList<>(idsSet);
    }

    static class Group {
        List<Integer> ids = new ArrayList<>();
    }

    public static void main(String[] args) throws IOException {
        File file = File.createTempFile("deleteme", "txt");
        Set<String> pairs = new HashSet<>();
        try (PrintWriter pw = new PrintWriter(file)) {
            pw.println("source;target");
            Random rand = new Random();
            int count = 1000000;
            while (pairs.size() < count) {
                int a = rand.nextInt(count);
                int b = rand.nextInt(count);
                if (a < b) {
                    int t = a;
                    a = b;
                    b = t;
                }
                pairs.add(a + ";" + b);
            }
            for (String pair : pairs) {
                pw.println(pair);
            }
        }
        System.out.println("Processing");
        long start = System.currentTimeMillis();
        List<List<Integer>> results = GroupedRelationBuilder.load(file);
        System.out.println(results.size() + " took " + (System.currentTimeMillis() - start) / 1e3 + " sec");
    }
}

打印一百万对

Processing
105612 took 12.719 sec

【讨论】：

感谢您提供详细的代码和答案，它确实非常快，我认为从源代码构建我想要的东西是更好的方法。我确实发现了一些特点......当我从我的问题中输入示例时，idsSet 包含以下内容：[[1, 5, 9], [4, 7, 4, 7, 8], [2, 3]]不幸的是，仍然有一些重复。最后但同样重要的是，我需要为每个组提供一个递增的标识符，例如：groupId 0 -> [1,5,9]。也许您可以在实施您的解决方案后阐明我的发现？

【解决方案3】：

由于Integer.toString() 的使用，您的实施速度很慢。更改类型意味着对象和内存分配。现在在子循环中执行 4-5 次。

将它从 126 毫秒更改为 35 毫秒：快 4 倍！

我看到的其他几件事是：

第一个for循环可以改成while(!relations.isEmpty())
第二个循环可以通过使用迭代器for (Iterator<Relation> iterator = relations.iterator(); iterator.hasNext();) 来完成。当您删除一个项目时，您现在正在跳过下一个项目。
将ids 的声明放在循环内

【讨论】：

感谢这些改进。我已经实现了其中的大部分，它们确实使它更快。 Integer.toString() 的用法源于关系文件是从数据库中卸载的事实，而主键并不总是整数。在应用程序的后面部分，我将使用 groupedRelation 列表来检查一条记录是否与另一条记录有关系。此记录的主键可能是字符串。因此使用 toString。