Haskell 性能：组合与应用？答案

【问题标题】：Haskell performance: Composition vs Application?Haskell 性能：组合与应用？
【发布时间】：2017-11-06 08:24:39
【问题描述】：

我看到了一些关于函数组合和应用程序之间的异同以及实现它的各种方法的问题，但是我开始有点困惑的一件事（就我搜索而言还没有被问到）是关于性能差异。

当我学习 F# 时，我爱上了管道运算符 |>，它在 haskell 的反向应用程序 & 中是等效的。但在我看来，F# 变体无疑更漂亮（而且我不认为我是唯一的）。

现在，可以轻松地将管道操作符破解到 haskell 中：

(|>) x f = f x

它就像一个魅力！问题解决了！

管道之间的最大区别（在 F# 和我们的 haskell 技巧中）是它不组合函数，它基于函数应用程序。它接受左边的值并将其传递给右边的函数，而不是组合，它接受两个函数并返回另一个函数，然后可以将其用作任何常规函数。

至少对我来说，这使代码更漂亮，因为您只使用一个运算符来引导整个函数中的信息流，从参数到最终值，因为使用基本组合（或 >>>）您无法放置左侧的一个值，使其通过“链”。

但是从性能的角度来看，看看这些通用选项，结果应该是完全相同的：

f x = x |> func1 |> func2 |> someLambda |> someMap |> someFold |> show

f x = x & (func1 >>> func2 >>> someLambda >>> someMap >>> someFold >>> show)

f x = (func1 >>> func2 >>> someLambda >>> someMap >>> someFold >>> show) x

哪一个最快，基于重复应用的还是基于组合和单一应用的？

【问题讨论】：

你有没有试过做测试看看哪个更快？
@AJFarmar 我想过，但是有没有办法在 haskell 中测量执行时间？
如果没有办法在 haskell 中测量执行时间，您的问题“哪个更快”将无法回答。
在几乎所有情况下，执行时间都将完全相同，至少如果您使用优化进行编译，因为优化后的代码将完全相同。如果您不进行优化编译，我猜第一个可能会更快，但是如果您关心性能，为什么不进行优化编译呢？
考虑运行criterion 并编写一个小型基准测试。很可能，启用优化后，您不会发现任何差异。

标签： haskell f# functional-programming

【解决方案1】：

根本不应该有任何区别，只要 (|>) 和 (>>>) 内联。让我们编写一个使用四个不同函数的示例，两个是 F# 风格，两个是 Haskell 风格：

import Data.Char (isUpper)

{-# INLINE (|>) #-}
(|>) :: a -> (a -> b) -> b
(|>) x f = f x

{-# INLINE (>>>) #-}
(>>>) :: (a -> b) -> (b -> c) -> a -> c
(>>>) f g x = g (f x)

compositionF :: String -> String
compositionF = filter isUpper >>> length >>> show 

applicationF :: String -> String
applicationF x = x |> filter isUpper |> length |> show 

compositionH :: String -> String
compositionH = show . length . filter isUpper

applicationH :: String -> String
applicationH x = show $ length $ filter isUpper $ x

main :: IO ()
main = do
  getLine >>= putStrLn . compositionF  -- using the functions
  getLine >>= putStrLn . applicationF  -- to make sure that
  getLine >>= putStrLn . compositionH  -- we actually get the
  getLine >>= putStrLn . applicationH  -- corresponding GHC core

如果我们用-ddump-simpl -dsuppress-all -O0 编译我们的代码，我们会得到：

==================== Tidy Core ====================
Result size of Tidy Core = {terms: 82, types: 104, coercions: 0}

-- RHS size: {terms: 9, types: 11, coercions: 0}
>>>_rqe
>>>_rqe =
  \ @ a_a1cE @ b_a1cF @ c_a1cG f_aqr g_aqs x_aqt ->
    g_aqs (f_aqr x_aqt)

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule1_r1gR
$trModule1_r1gR = TrNameS "main"#

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule2_r1h6
$trModule2_r1h6 = TrNameS "Main"#

-- RHS size: {terms: 3, types: 0, coercions: 0}
$trModule
$trModule = Module $trModule1_r1gR $trModule2_r1h6

-- RHS size: {terms: 58, types: 73, coercions: 0}
main
main =
  >>
    $fMonadIO
    (>>=
       $fMonadIO
       getLine
       (. putStrLn
          (>>>_rqe
             (>>>_rqe (filter isUpper) (length $fFoldable[]))
             (show $fShowInt))))
    (>>
       $fMonadIO
       (>>=
          $fMonadIO
          getLine
          (. putStrLn
             (\ x_a10M ->
                show $fShowInt (length $fFoldable[] (filter isUpper x_a10M)))))
       (>>
          $fMonadIO
          (>>=
             $fMonadIO
             getLine
             (. putStrLn
                (. (show $fShowInt) (. (length $fFoldable[]) (filter isUpper)))))
          (>>=
             $fMonadIO
             getLine
             (. putStrLn
                (\ x_a10N ->
                   show $fShowInt (length $fFoldable[] (filter isUpper x_a10N)))))))

-- RHS size: {terms: 2, types: 1, coercions: 0}
main
main = runMainIO main

所以如果我们不启用优化，>>> 不会被内联。但是，如果我们启用优化，您将根本看不到 >>> 或 (.)。我们的函数略有不同，因为 (.) 在那个阶段没有内联，但这是意料之中的。

如果我们将{-# NOINLINE … #-} 添加到我们的函数并启用优化，我们会发现这四个函数根本没有区别：

$ ghc -ddump-simpl -dsuppress-all -O2 Example.hs
[1 of 1] Compiling Main             ( Example.hs, Example.o )

==================== Tidy Core ====================
Result size of Tidy Core = {terms: 261, types: 255, coercions: 29}

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule2
$trModule2 = TrNameS "main"#

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule1
$trModule1 = TrNameS "Main"#

-- RHS size: {terms: 3, types: 0, coercions: 0}
$trModule
$trModule = Module $trModule2 $trModule1

Rec {
-- RHS size: {terms: 29, types: 20, coercions: 0}
$sgo_r574
$sgo_r574 =
  \ sc_s55y sc1_s55x ->
    case sc1_s55x of _ {
      [] -> I# sc_s55y;
      : y_a2j9 ys_a2ja ->
        case y_a2j9 of _ { C# c#_a2hF ->
        case {__pkg_ccall base-4.9.1.0 u_iswupper Int#
                                     -> State# RealWorld -> (# State# RealWorld, Int# #)}_a2hE
               (ord# c#_a2hF) realWorld#
        of _ { (# ds_a2hJ, ds1_a2hK #) ->
        case ds1_a2hK of _ {
          __DEFAULT -> $sgo_r574 (+# sc_s55y 1#) ys_a2ja;
          0# -> $sgo_r574 sc_s55y ys_a2ja
        }
        }
        }
    }
end Rec }

-- RHS size: {terms: 15, types: 14, coercions: 0}
applicationH
applicationH =
  \ x_a12X ->
    case $sgo_r574 0# x_a12X of _ { I# ww3_a2iO ->
    case $wshowSignedInt 0# ww3_a2iO []
    of _ { (# ww5_a2iS, ww6_a2iT #) ->
    : ww5_a2iS ww6_a2iT
    }
    }

Rec {
-- RHS size: {terms: 29, types: 20, coercions: 0}
$sgo1_r575
$sgo1_r575 =
  \ sc_s55r sc1_s55q ->
    case sc1_s55q of _ {
      [] -> I# sc_s55r;
      : y_a2j9 ys_a2ja ->
        case y_a2j9 of _ { C# c#_a2hF ->
        case {__pkg_ccall base-4.9.1.0 u_iswupper Int#
                                     -> State# RealWorld -> (# State# RealWorld, Int# #)}_a2hE
               (ord# c#_a2hF) realWorld#
        of _ { (# ds_a2hJ, ds1_a2hK #) ->
        case ds1_a2hK of _ {
          __DEFAULT -> $sgo1_r575 (+# sc_s55r 1#) ys_a2ja;
          0# -> $sgo1_r575 sc_s55r ys_a2ja
        }
        }
        }
    }
end Rec }

-- RHS size: {terms: 15, types: 15, coercions: 0}
compositionH
compositionH =
  \ x_a1jF ->
    case $sgo1_r575 0# x_a1jF of _ { I# ww3_a2iO ->
    case $wshowSignedInt 0# ww3_a2iO []
    of _ { (# ww5_a2iS, ww6_a2iT #) ->
    : ww5_a2iS ww6_a2iT
    }
    }

Rec {
-- RHS size: {terms: 29, types: 20, coercions: 0}
$sgo2_r576
$sgo2_r576 =
  \ sc_s55k sc1_s55j ->
    case sc1_s55j of _ {
      [] -> I# sc_s55k;
      : y_a2j9 ys_a2ja ->
        case y_a2j9 of _ { C# c#_a2hF ->
        case {__pkg_ccall base-4.9.1.0 u_iswupper Int#
                                     -> State# RealWorld -> (# State# RealWorld, Int# #)}_a2hE
               (ord# c#_a2hF) realWorld#
        of _ { (# ds_a2hJ, ds1_a2hK #) ->
        case ds1_a2hK of _ {
          __DEFAULT -> $sgo2_r576 (+# sc_s55k 1#) ys_a2ja;
          0# -> $sgo2_r576 sc_s55k ys_a2ja
        }
        }
        }
    }
end Rec }

-- RHS size: {terms: 15, types: 15, coercions: 0}
compositionF
compositionF =
  \ x_a1jF ->
    case $sgo2_r576 0# x_a1jF of _ { I# ww3_a2iO ->
    case $wshowSignedInt 0# ww3_a2iO []
    of _ { (# ww5_a2iS, ww6_a2iT #) ->
    : ww5_a2iS ww6_a2iT
    }
    }

Rec {
-- RHS size: {terms: 29, types: 20, coercions: 0}
$sgo3_r577
$sgo3_r577 =
  \ sc_s55d sc1_s55c ->
    case sc1_s55c of _ {
      [] -> I# sc_s55d;
      : y_a2j9 ys_a2ja ->
        case y_a2j9 of _ { C# c#_a2hF ->
        case {__pkg_ccall base-4.9.1.0 u_iswupper Int#
                                     -> State# RealWorld -> (# State# RealWorld, Int# #)}_a2hE
               (ord# c#_a2hF) realWorld#
        of _ { (# ds_a2hJ, ds1_a2hK #) ->
        case ds1_a2hK of _ {
          __DEFAULT -> $sgo3_r577 (+# sc_s55d 1#) ys_a2ja;
          0# -> $sgo3_r577 sc_s55d ys_a2ja
        }
        }
        }
    }
end Rec }

-- RHS size: {terms: 15, types: 14, coercions: 0}
applicationF
applicationF =
  \ x_a12W ->
    case $sgo3_r577 0# x_a12W of _ { I# ww3_a2iO ->
    case $wshowSignedInt 0# ww3_a2iO []
    of _ { (# ww5_a2iS, ww6_a2iT #) ->
    : ww5_a2iS ww6_a2iT
    }
    }
...

所有go 函数完全相同（无变量名），application* 与composition* 相同。所以继续在 Haskell 中创建自己的 F# 前奏，应该不会有任何性能问题。

【讨论】：

【解决方案2】：

我的回答是关于 F#。

在大多数情况下，F# 编译器能够将管道优化为相同的代码：

let f x = x |> (+) 1 |> (*) 2 |> (+) 2
let g x = x |> ((+) 1 >> (*) 2 >> (+) 2)

反编译f和g我们看到编译器达到了相同的结果：

public static int f(int x)
{
    return 2 + 2 * (1 + x);
}
public static int g(int x)
{
    return 2 + 2 * (1 + x);
}

但它似乎并不总是成立，正如我们在稍微更高级的管道中看到的那样：

let f x = x |>  Array.map add1 |> Array.map mul2 |> Array.map add2 |> Array.reduce (+)
let g x = x |> (Array.map add1 >> Array.map mul2 >> Array.map add2 >> Array.reduce (+))

反编译显示一些差异：

public static int f(int[] x)
{
  FSharpFunc<int, FSharpFunc<int, int>> arg_25_0 = new Program.f@9();
  if (x == null)
  {
    throw new ArgumentNullException("array");
  }
  int[] array = new int[x.Length];
  FSharpFunc<int, FSharpFunc<int, int>> fSharpFunc = arg_25_0;
  for (int i = 0; i < array.Length; i++)
  {
    array[i] = x[i] + 1;
  }
  FSharpFunc<int, FSharpFunc<int, int>> arg_6C_0 = fSharpFunc;
  int[] array2 = array;
  if (array2 == null)
  {
    throw new ArgumentNullException("array");
  }
  array = new int[array2.Length];
  fSharpFunc = arg_6C_0;
  for (int i = 0; i < array.Length; i++)
  {
    array[i] = array2[i] * 2;
  }
  FSharpFunc<int, FSharpFunc<int, int>> arg_B3_0 = fSharpFunc;
  int[] array3 = array;
  if (array3 != null)
  {
    array2 = new int[array3.Length];
    fSharpFunc = arg_B3_0;
    for (int i = 0; i < array2.Length; i++)
    {
      array2[i] = array3[i] + 2;
    }
    return ArrayModule.Reduce<int>(fSharpFunc, array2);
  }
  throw new ArgumentNullException("array");
}

public static int g(int[] x)
{
  FSharpFunc<int[], int[]> f = new Program.g@10-1();
  FSharpFunc<int[], int[]> fSharpFunc = new Program.g@10-3(f);
  FSharpFunc<int, FSharpFunc<int, int>> reduction = new Program.g@10-4();
  int[] array = fSharpFunc.Invoke(x);
  return ArrayModule.Reduce<int>(reduction, array);
}

对于f，F# 将管道内联，但最后的 reduce 除外。

对于g，管道被构造然后调用。这意味着g 可能比f 稍微慢一些，并且内存占用更多。

在这个特定的示例中，它可能并不重要，因为我们正在创建数组对象并对其进行迭代，但如果所组成的函数在 CPU 和内存方面非常便宜，那么建立和调用管道的成本可能会被证明是相关的。

如果关键性能对您很重要，我建议您使用一个好的反编译器工具来确保生成的代码不包含意外开销。否则你可能对这两种方法都很好。

【讨论】：

所有这些丑陋的null 检查都是来自内联的Array.map 函数，还是仅由管道生成？
If critical performance is important to you I recommend getting a good decompiler tool to make sure that the generated code doesn't contain unexpected overhead- 然后明确评论为什么代码是这样的，这样别人就不会过度重构它。
@leftaroundabout 它们似乎来自module Array 的内联函数，这些函数在许多函数中使用checkNonNull "array" array。更聪明的编译器可能会发现其中许多检查是不必要的并将它们删除。如果我们幸运的话，抖动会为我们消除它们。