为了跟进 Don 2010 年的回复,在 GHC 8.0.2 上我们可以做得更好。首先让我们试试他的版本。
module Main (main) where
import System.CPUTime.Rdtsc (rdtsc)
import Text.Printf (printf)
import qualified Data.Vector.Unboxed as U
data Pair = Pair {-# UNPACK #-}!Int {-# UNPACK #-}!Double
mean' :: U.Vector Double -> Double
mean' xs = s / fromIntegral n
where
Pair n s = U.foldl' k (Pair 0 0) xs
k (Pair n s) x = Pair (n+1) (s+x)
main :: IO ()
main = do
s <- rdtsc
let r = mean' (U.enumFromN 1 30000000)
e <- seq r rdtsc
print (e - s, r)
这给了我们
[nix-shell:/tmp]$ ghc -fforce-recomp -O2 MeanD.hs -o MeanD && ./MeanD +RTS -s
[1 of 1] Compiling Main ( MeanD.hs, MeanD.o )
Linking MeanD ...
(372877482,1.50000005e7)
240,104,176 bytes allocated in the heap
6,832 bytes copied during GC
44,384 bytes maximum residency (1 sample(s))
25,248 bytes maximum slop
230 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.006s 0.006s 0.0062s 0.0062s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.087s ( 0.087s elapsed)
GC time 0.006s ( 0.006s elapsed)
EXIT time 0.006s ( 0.006s elapsed)
Total time 0.100s ( 0.099s elapsed)
%GC time 6.2% (6.2% elapsed)
Alloc rate 2,761,447,559 bytes per MUT second
Productivity 93.8% of total user, 93.8% of total elapsed
但是代码很简单:理想情况下应该不需要向量:优化代码应该可以通过内联列表生成来实现。幸运的是,GHC 可以为我们做到这一点[0]。
module Main (main) where
import System.CPUTime.Rdtsc (rdtsc)
import Text.Printf (printf)
import Data.List (foldl')
data Pair = Pair {-# UNPACK #-}!Int {-# UNPACK #-}!Double
mean' :: [Double] -> Double
mean' xs = v / fromIntegral l
where
Pair l v = foldl' f (Pair 0 0) xs
f (Pair l' v') x = Pair (l' + 1) (v' + x)
main :: IO ()
main = do
s <- rdtsc
let r = mean' $ fromIntegral <$> [1 :: Int .. 30000000]
-- This is slow!
-- r = mean' [1 .. 30000000]
e <- seq r rdtsc
print (e - s, r)
这给了我们:
[nix-shell:/tmp]$ ghc -fforce-recomp -O2 MeanD.hs -o MeanD && ./MeanD +RTS -s
[1 of 1] Compiling Main ( MeanD.hs, MeanD.o )
Linking MeanD ...
(128434754,1.50000005e7)
104,064 bytes allocated in the heap
3,480 bytes copied during GC
44,384 bytes maximum residency (1 sample(s))
17,056 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.032s ( 0.032s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.033s ( 0.032s elapsed)
%GC time 0.1% (0.1% elapsed)
Alloc rate 3,244,739 bytes per MUT second
Productivity 99.8% of total user, 99.8% of total elapsed
[0]:注意我必须如何映射fromIntegral:没有这个,GHC 无法消除[Double],并且解决方案要慢得多。这有点可悲:我不明白为什么 GHC 无法内联/决定它不需要没有这个。如果你确实有真正的分数集合,那么这个技巧对你不起作用,向量可能仍然是必要的。