是否有任何“阈值”证明多线程计算是合理的？答案

【问题标题】：Is there any "threshold" justifying multithreaded computation?是否有任何“阈值”证明多线程计算是合理的？
【发布时间】：2012-05-28 14:13:31
【问题描述】：

所以基本上我今天需要优化这段代码。它试图找到由某个函数为前一百万个起始数字生成的最长序列：

public static void main(String[] args) {
    int mostLen = 0;
    int mostInt = 0;
    long currTime = System.currentTimeMillis();
    for(int j=2; j<=1000000; j++) {
        long i = j;
        int len = 0;
        while((i=next(i)) != 1) {
            len++;
        }
        if(len > mostLen) {
            mostLen = len;
            mostInt = j;
        }
    }
    System.out.println(System.currentTimeMillis() - currTime);
    System.out.println("Most len is " + mostLen + " for " + mostInt);
}


static long next(long i) {
    if(i%2==0) {
        return i/2;
    } else {
        return i*3+1;
    }
}

我的错误是尝试引入多线程：

void doSearch() throws ExecutionException, InterruptedException {
    final int numProc = Runtime.getRuntime().availableProcessors();
    System.out.println("numProc = " + numProc);
    ExecutorService executor = Executors.newFixedThreadPool(numProc);
    long currTime = System.currentTimeMillis();
    List<Future<ValueBean>> list = new ArrayList<Future<ValueBean>>();
    for (int j = 2; j <= 1000000; j++) {
        MyCallable<ValueBean> worker = new MyCallable<ValueBean>();
        worker.setBean(new ValueBean(j, 0));
        Future<ValueBean> f = executor.submit(worker);
        list.add(f);
    }
    System.out.println(System.currentTimeMillis() - currTime);

    int mostLen = 0;
    int mostInt = 0;
    for (Future<ValueBean> f : list) {
        final int len = f.get().getLen();
        if (len > mostLen) {
            mostLen = len;
            mostInt = f.get().getNum();
        }
    }
    executor.shutdown();
    System.out.println(System.currentTimeMillis() - currTime);
    System.out.println("Most len is " + mostLen + " for " + mostInt);
}

public class MyCallable<T> implements Callable<ValueBean> {
    public ValueBean bean;

    public void setBean(ValueBean bean) {
        this.bean = bean;
    }

    public ValueBean call() throws Exception {
        long i = bean.getNum();
        int len = 0;
        while ((i = next(i)) != 1) {
            len++;
        }
        return new ValueBean(bean.getNum(), len);
    }
}

public class ValueBean {
    int num;
    int len;

    public ValueBean(int num, int len) {
        this.num = num;
        this.len = len;
    }

    public int getNum() {
        return num;
    }

    public int getLen() {
        return len;
    }
}

long next(long i) {
    if (i % 2 == 0) {
        return i / 2;
    } else {
        return i * 3 + 1;
    }
}

不幸的是，多线程版本在 4 个处理器（内核）上的运行速度比单线程版本慢 5 倍。

然后我尝试了更粗略的方法：

static int mostLen = 0;
static int mostInt = 0;

synchronized static void updateIfMore(int len, int intgr) {
    if (len > mostLen) {
        mostLen = len;
        mostInt = intgr;
    }
}

public static void main(String[] args) throws InterruptedException {
    long currTime = System.currentTimeMillis();
    final int numProc = Runtime.getRuntime().availableProcessors();
    System.out.println("numProc = " + numProc);
    ExecutorService executor = Executors.newFixedThreadPool(numProc);
    for (int i = 2; i <= 1000000; i++) {
        final int j = i;
        executor.execute(new Runnable() {
            public void run() {
                long l = j;
                int len = 0;
                while ((l = next(l)) != 1) {
                    len++;
                }
                updateIfMore(len, j);
            }
        });
    }
    executor.shutdown();
    executor.awaitTermination(30, TimeUnit.SECONDS);
    System.out.println(System.currentTimeMillis() - currTime);
    System.out.println("Most len is " + mostLen + " for " + mostInt);
}


static long next(long i) {
    if (i % 2 == 0) {
        return i / 2;
    } else {
        return i * 3 + 1;
    }
}

它的工作速度要快得多，但仍然比单线程方法慢。

我希望这不是因为我搞砸了我做多线程的方式，而是这种特殊的计算/算法不适合并行计算。如果我通过将方法next 替换为：

long next(long i) {
    Random r = new Random();
    for(int j=0; j<10; j++) {
        r.nextLong();
    }
    if (i % 2 == 0) {
        return i / 2;
    } else {
        return i * 3 + 1;
    }
}

在 4 核机器上，两个多线程版本的执行速度都是单线程版本的两倍多。

很明显，必须有一些阈值可以用来确定是否值得引入多线程，我的问题是：

帮助确定给定计算是否足够密集以通过并行运行对其进行优化的基本规则是什么（无需花费精力来实际实现它？）

【问题讨论】：

这只是与问题无关，但有问题的算法与Collatz conjecture有关。由于this 和this，它在极客圈更出名。
我强烈推荐 Brian Goetz 的书 Java Concurrency in Practice。

标签： java multithreading

【解决方案1】：

“性能增益会大于上下文切换和线程创建的成本吗？”

这是一个非常依赖操作系统、语言和硬件的成本； this question 有一些关于 Java 成本的讨论，但有一些数字和一些关于如何计算成本的指针。

对于 CPU 密集型工作，您还希望每个 CPU 拥有一个或更少的线程。感谢 David Harkness 的指针to a thread on how to work out that number。

【讨论】：

+1 表示每个 CPU 一个线程用于 CPU 繁重的任务，但您通常希望每个 CPU 一个用于工作加上一个（主线程）用于协调。
另外，请参阅this answer，了解如何查找可用 CPU 内核的数量和其他有用的位。

【解决方案2】：

高效实现多线程的关键是确保成本不会太高。没有固定的规则，因为它们在很大程度上取决于您的硬件。

启动和停止线程的成本很高。当然，您已经使用了 executor 服务，这大大降低了这些成本，因为它使用一堆工作线程来执行您的 Runnables。然而，每个 Runnable 仍然会带来一些开销。减少可运行对象的数量并增加每个可运行对象的工作量将提高性能，但您仍然希望为执行器服务提供足够的可运行对象，以便将它们有效地分布在工作线程上。

您已选择为每个起始值创建一个可运行对象，因此您最终会创建 1000000 个可运行对象。如果让每个 Runnable 执行一批比如说 1000 个起始值，您可能会得到更好的结果。这意味着您只需要 1000 个可运行文件，从而大大降低了开销。

【讨论】：

+1 用于将批处理用作 1,000,000 个任务具有高开销且回报太低（减少由于线程无事可做而导致的“生产力损失”）。

【解决方案3】：

估计一个线程在不与其他线程交互（直接或通过公共数据）的情况下可以完成的工作量。如果那件工作可以在 1 微秒或更短的时间内完成，那么开销就太大了，多线程是没有用的。如果是 1 毫秒或更长，多线程应该可以正常工作。如果介于两者之间，则需要进行实验测试。

【讨论】：

【解决方案4】：

我认为您没有考虑到另一个组件。当工作单元彼此不依赖时，并行化效果最好。当后面的计算结果依赖于前面的计算结果时，并行运行计算是次优的。在“我需要第一个值来计算第二个值”的意义上，这种依赖性可能很强。在这种情况下，任务是完全串行的，如果不等待早期的计算，就无法计算后面的值。在“如果我有第一个值，我可以更快地计算第二个值”的意义上，也可能存在较弱的依赖性。在这种情况下，并行化的代价是可能会重复一些工作。

这个问题适合在不使用多线程的情况下进行优化，因为如果您已经掌握了先前的结果，则可以更快地计算一些后面的值。以j == 4 为例。一旦通过内部循环产生i == 2，但您刚刚计算了两次迭代前j == 2 的结果，如果您保存了len 的值，您可以将其计算为len(4) = 1 + len(2)。

使用数组来存储之前计算的len 的值，并在next 方法中稍作调整，您可以以>50 倍的速度完成任务。

【讨论】：

是的，这比 1000 批多线程的运行速度快 8 倍！我想知道我是否可以多线程这个
@OlegMikheev 有可能。我会研究ConcurrentHashMap，这样我就可以构建缓存而不必担心锁定。虽然我认为数组实现非常快，因为只要i < j 你知道它在缓存中，但哈希查找可能会慢很多。如果您可以利用next 函数的其他数学属性，则很容易证明具有最大长度的 j 必须满足 j > n / 2 的限制 n。这有助于多线程解决方案，但不利于缓存解决方案。此外，简单的数组缓存也无法扩展到 > ~42,000,000。