优化 Haskell 代码答案

【问题标题】：Optimizing Haskell code优化 Haskell 代码
【发布时间】：2011-02-24 07:13:27
【问题描述】：

我正在尝试学习 Haskell，在 reddit 上一篇关于 Markov 文本链的文章之后，我决定先在 Python 中实现 Markov 文本生成，现在在 Haskell 中实现。但是我注意到我的 python 实现比 Haskell 版本快得多，甚至 Haskell 也被编译为本机代码。我想知道我应该怎么做才能使 Haskell 代码运行得更快，现在我相信它会因为使用 Data.Map 而不是 hashmap 而慢得多，但我不确定

我将发布 Python 代码和 Haskell。使用相同的数据，Python 大约需要 3 秒，而 Haskell 则接近 16 秒。

毫无疑问，我会接受任何建设性的批评:)。

import random
import re
import cPickle
class Markov:
    def __init__(self, filenames):
        self.filenames = filenames
        self.cache = self.train(self.readfiles())
        picklefd = open("dump", "w")
        cPickle.dump(self.cache, picklefd)
        picklefd.close()

    def train(self, text):
        splitted = re.findall(r"(\w+|[.!?',])", text)
        print "Total of %d splitted words" % (len(splitted))
        cache = {}
        for i in xrange(len(splitted)-2):
            pair = (splitted[i], splitted[i+1])
            followup = splitted[i+2]
            if pair in cache:
                if followup not in cache[pair]:
                    cache[pair][followup] = 1
                else:
                    cache[pair][followup] += 1
            else:
                cache[pair] = {followup: 1}
        return cache

    def readfiles(self):
        data = ""
        for filename in self.filenames:
            fd = open(filename)
            data += fd.read()
            fd.close()
        return data

    def concat(self, words):
        sentence = ""
        for word in words:
            if word in "'\",?!:;.":
                sentence = sentence[0:-1] + word + " "
            else:
                sentence += word + " "
        return sentence

    def pickword(self, words):
        temp = [(k, words[k]) for k in words]
        results = []
        for (word, n) in temp:
            results.append(word)
            if n > 1:
                for i in xrange(n-1):
                    results.append(word)
        return random.choice(results)

    def gentext(self, words):
        allwords = [k for k in self.cache]
        (first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache]))
        sentence = [first, second]
        while len(sentence) < words or sentence[-1] is not ".":
            current = (sentence[-2], sentence[-1])
            if current in self.cache:
                followup = self.pickword(self.cache[current])
                sentence.append(followup)
            else:
                print "Wasn't able to. Breaking"
                break
        print self.concat(sentence)

Markov(["76.txt"])

--

module Markov
( train
, fox
) where

import Debug.Trace
import qualified Data.Map as M
import qualified System.Random as R
import qualified Data.ByteString.Char8 as B


type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)

train :: [B.ByteString] -> Database
train (x:y:[]) = M.empty
train (x:y:z:xs) = 
     let l = train (y:z:xs)
     in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l

main = do
  contents <- B.readFile "76.txt"
  print $ train $ B.words contents

fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."

【问题讨论】：

有意思，也在寻找答案。 16 秒与 3 秒的差别真的很大。
顺便说一句，Python 代码的缩进似乎被破坏了......
我不认为你的 Haskell 代码能完成你想要的。如果您检查输出，您将看到M.Map String Int 映射中没有大于 2 的值。你是说n + o 还是o + 1 而不是n + 1？
@Travis 你是绝对正确的，但它应该在编辑版本中修复
您在“in M.insertWith'”开头的行中使用seq 是可疑的。您正在构建一个大型表达式并对其进行评估，然后丢弃结果并返回其他内容。你的意思是切换参数，即 l seq M.insertWith ...

标签： performance optimization haskell

【解决方案1】：

a) 你是如何编译它的？ (ghc -O2 ?)

b) 哪个版本的 GHC？

c) Data.Map 非常高效，但您可能会被欺骗进行延迟更新——使用 insertWith' ，而不是 insertWithKey。

d) 不要将字节串转换为字符串。将它们保存为字节串，并将它们存储在 Map 中

【讨论】：

版本为6.12.1。在您的帮助下，我能够从运行时中挤出 1 秒，但距离 python 版本还很远。

【解决方案2】：

Data.Map 是在假设 Ord 类比较花费恒定时间的情况下设计的。对于字符串键，情况可能并非如此——当字符串相等时，情况绝不会如此。你可能会也可能不会遇到这个问题，这取决于你的语料库有多大以及有多少单词有共同的前缀。

我很想尝试一种旨在使用序列键进行操作的数据结构，例如 Don Stewart 建议的 bytestring-trie 包。

【讨论】：

@don：感谢您的更新。我相信你至少知道 hackage 的 60% 的内容 :-)

【解决方案3】：

我尽量避免做任何花哨或微妙的事情。这些只是进行分组的两种方法；第一个强调模式匹配，第二个不强调。

import Data.List (foldl')
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B

type Database2 = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)

train2 :: [B.ByteString] -> Database2
train2 words = go words M.empty
    where go (x:y:[]) m = m
          go (x:y:z:xs) m = let addWord Nothing   = Just $ M.singleton z 1
                                addWord (Just m') = Just $ M.alter inc z m'
                                inc Nothing    = Just 1
                                inc (Just cnt) = Just $ cnt + 1
                            in go (y:z:xs) $ M.alter addWord (x,y) m

train3 :: [B.ByteString] -> Database2
train3 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
    where update m (x,y,z) = M.alter (addWord z) (x,y) m
          addWord word = Just . maybe (M.singleton word 1) (M.alter inc word)
          inc = Just . maybe 1 (+1)

main = do contents <- B.readFile "76.txt"
          let db = train3 $ B.words contents
          print $ "Built a DB of " ++ show (M.size db) ++ " words"

我认为它们都比原始版本更快，但我承认我只在我找到的第一个合理的语料库上尝试了它们。

编辑根据 Travis Brown 非常有效的观点，

train4 :: [B.ByteString] -> Database2
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
    where update m (x,y,z) = M.insertWith (inc z) (x,y) (M.singleton z 1) m
          inc k _ = M.insertWith (+) k 1

【讨论】：

就风格而言，我认为在这里使用比alter 更具体的东西会更好。我们知道在这种情况下我们永远不需要删除，并且必须像这样添加 Just 会损害可读性。
抱歉回复晚了。您能否解释一下为什么这是一个更快的解决方案？基本上两者都做同样的事情，除了拉链和放下。

【解决方案4】：

这是一个基于foldl' 的版本，它的速度似乎是您的train 的两倍：

train' :: [B.ByteString] -> Database
train' xs = foldl' (flip f) M.empty $ zip3 xs (tail xs) (tail $ tail xs)
  where
    f (a, b, c) = M.insertWith (M.unionWith (+)) (a, b) (M.singleton c 1)

我在 Project Gutenberg Huckleberry Finn（我假设是您的 76.txt）上进行了尝试，它产生的输出与您的函数相同。我的时序比较很不科学，但这种方法可能值得一看。

【讨论】：

【解决方案5】：

1) 我不清楚你的代码。 a）您定义了“狐狸”但不使用它。您的意思是让我们尝试帮助您使用“狐狸”而不是阅读文件吗？ b）您将其声明为“模块马尔科夫”，然后在模块中有一个“主”。 c) 不需要 System.Random。如果您在发布之前稍微清理一下代码，它确实可以帮助我们帮助您。

2) 使用 ByteStrings 和 Don 说的一些严格的操作。

3) 使用 -O2 编译并使用 -fforce-recomp 确保您确实重新编译了代码。

4) 试试这个轻微的转换，它的工作速度非常快（0.005 秒）。显然，输入非常小，因此您需要提供文件或自己测试。

{-# LANGUAGE OverloadedStrings, BangPatterns #-}
module Main where

import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B


type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)

train :: [B.ByteString] -> Database
train xs = go xs M.empty
  where
  go :: [B.ByteString] -> Database -> Database
  go (x:y:[]) !m = m
  go (x:y:z:xs) !m =
     let m' =  M.insertWithKey' (\key new old -> M.insertWithKey' (\_ n o -> n + 1) z 1 old) (x, y) (M.singleton z 1) m
     in go (y:z:xs) m'

main = print $ train $ B.words fox

fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."

【讨论】：

嗯，是的，我是一个初学者，就像标签上写的：P。我没有意识到将模块命名为 Main 以外的其他名称的后果。狐狸被我用来测试算法。检查小输入比检查整本书的输入更容易

【解决方案6】：

正如 Don 建议的那样，考虑使用更严格的函数版本：insertWithKey'（和 M.insertWith'，因为无论如何您第二次都忽略了关键参数）。

看起来您的代码可能会构建很多 thunk，直到它到达您的 [String] 末尾。

查看：http://book.realworldhaskell.org/read/profiling-and-optimization.html

...尤其是尝试绘制堆图形（大约在本章的一半）。有兴趣看看你的想法。

【讨论】：

我做了 Don Stewart 建议的更改。以前代码占用 41-44 兆字节的内存，现在只占用 29 兆字节。内存图形显示 TSO 占用了大部分内存，然后是 GHC.types，然后是代码中使用的其他数据类型。内存在所有部分快速增加一秒钟。在那一秒钟之后，TSO 和 GHC.types 不断增加，所有其他的都开始缓慢下降。（如果我没看错图）