从“Real World Haskell”实现 splitWith 的正确方法是什么？答案

【问题标题】：What is correct way to implement splitWith from "Real World Haskell?"从“Real World Haskell”实现 splitWith 的正确方法是什么？
【发布时间】：2013-10-23 13:30:48
【问题描述】：

我一直在通过真实世界的 haskell 工作并尝试进行练习。我设法从第 4.5 章练习 2 中实现了 splitWith 的工作版本。我觉得这不是一个非常haskell 的做事方式。必须使用累加器实现新功能似乎非常迂回。有没有更惯用的方法来做到这一点，比如折叠？我查看了 foldl 的文档，但对如何操作感到摸不着头脑。

splitWith :: (a -> Bool) -> [a] -> [[a]]
splitWith _ [] = []
splitWith f a  = splitWithAcc f a []
  where 
    splitWithAcc :: (a -> Bool) -> [a] -> [[a]] -> [[a]]
    splitWithAcc f xs acc
      | null xs     = acc
      | f $ head xs = splitWithAcc f (dropWhile f xs) (acc ++ [takeWhile f xs])
      | otherwise   = splitWithAcc f (tail xs) acc

澄清

这是练习的正文：

编写一个函数 splitWith ，它的作用类似于单词，但接受一个谓词和一个任意类型的列表，然后在谓词返回 False 的每个元素上拆分其输入列表：

【问题讨论】：

期望的输出是什么？当我以splitWith (== ' ') "This is a test" 运行此代码时，我得到[" "," "," "]，但如果我将它从== 更改为/=，我得到["This","is","a","test"]。您是要在条件为假的地方拆分，还是在条件为真的地方拆分？
我添加了练习的文字。正如你所说，这似乎有点不寻常。当谓词为假时，它想分裂。
我已更新我的答案以反映这一变化。

标签： haskell

【解决方案1】：

递归是你的朋友，但我会做一些不同的事情。首先，当我分裂时，我会让我的条件为真，而不是让它为假。其次，我将使用Data.List 中的一个方便的函数，称为break

> :t break
break :: (a -> Bool) -> [a] -> ([a], [a])
> break (== ' ') "This is a test"
("This", " is a test")

我会使用它来定义我的函数

splitWith' :: (a -> Bool) -> [a] -> [[a]]
splitWith' cond [] = []
splitWith' cond xs = first : splitWith' cond (safeTail rest)
    where
        (first, rest) = break cond xs
        -- Need this function to handle an empty list
        safeTail [] = []
        safeTail (_:ys) = ys

或者，如果你想写得尽可能混乱

splitWith'' :: (a -> Bool) -> [a] -> [[a]]
splitWith'' _ [] = []
splitWith'' cond xs = uncurry (:) $ fmap (splitWith'' cond . safeTail) $ break cond xs
    where
        safeTail [] = []
        safeTail (_:ys) = ys

这是有效的，因为fmap over 2-tuples 将函数应用于元组的第二个元素。然后它 uncurries : 并将其应用于第一个和其余部分。

更新

如果你希望它在谓词为 false 时拆分，你可以使用span 而不是break，或者直接定义为

splitWithWeird cond xs = splitWith' (not . cond) xs

虽然第二个显然会产生稍小的开销（除非编译器可以优化它）

更新 2

如果您想处理重复的字符，有一个简单、快速的解决方法（如果它适合您的需要）：

> filter (not . null) $ splitWithWeird (/= ' ') "This  is   a    test"
["This","is","a","test"]

有了如此简单的修复，我们可能会想将其内置到算法本身中：

splitWithWeird :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird cond [] = []
splitWithWeird cond xs = filter (not . null) $ first : splitWithWeird cond (safeTail rest)
    where
        (first, rest) = span cond xs
        safeTail [] = []
        safeTail (_:ys) = ys

但这不是一个好主意。由于这是一个递归函数，因此您将在每个级别添加对 filter (not . null) 的调用，因此在函数中的每个拆分位置。所有这些都必须在返回之前扫描整个列表，因此必须执行额外的检查。最好将它定义为一个单独的函数，这样filter (not . null) 只会被调用一次：

splitWithWeird' :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird' cond xs = filter (not . null) $ splitWithWeird cond xs

或者，如果您希望将其内置到算法中：

splitWithWeird :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird cond xs = filter (not . null) $ splitWithHelper cond xs
    where
        safeTail [] = []
        safeTail (_:ys) = ys
        splitWithHelper cond [] = []
        splitWithHelper cond xs =
            let (first, rest) = span cond xs
            in first : splitWithHelper cond (safeTail rest)

这实际上只是在内部做与定义两个函数相同的事情。请注意，我必须在这里使用附加的let ... in ... 语句（我不喜欢嵌套wheres），因为(first, rest) = span cond xs 属于splitWithHelper，而不是splitWithWeird。如果将其留在 where 子句中，算法将不起作用。

更新 3

不想在这里只留下一个不理想的解决方案，我已经继续编写了一个算法，用于在子序列上而不是在条件或元素上进行拆分。它确实使用了Control.Arrow 中的first 函数，但只是为了使代码更加紧凑。

import Control.Arrow (first)

isPrefixOf :: Eq a => [a] -> [a] -> Bool
isPrefixOf [] _ = True
isPrefixOf _ [] = False
isPrefixOf (x:xs) (y:ys) = x == y && isPrefixOf xs ys

splitSubseq :: Eq a => [a] -> [a] -> [[a]]
splitSubseq sub [] = []
splitSubseq sub xs = initial : splitSubseq sub rest
    where
        lsub = length sub
        splitter [] = ([], [])
        splitter yss@(y:ys)
            | isPrefixOf sub yss = ([], drop lsub yss)
            | otherwise = first (y :) $ splitter ys
        (initial, rest) = splitter xs

我并不是说这是一个有效的解决方案，但它应该很容易遵循。首先，我定义了一个函数isPrefixOf，如果第二个列表以第一个列表开头，则返回 True。

我想保持相同的递归模式（first : recursive rest），所以我写了splitter来代替span或break，这就是isPrefixOf的用武之地。如果子序列是列表的前缀，则返回([], restAfterSubsequence)，否则存储列表的第一个字符，然后从下一个元素开始重复此操作。我在这里使用first 只是为了递归简洁地编写这个函数。它只是将(y :) 应用于splitter 返回值的第一个元素。从splitter 返回的元组的第二个元素只是输入的其余部分尚未被使用。

如果您有兴趣，这里是该算法的性能统计数据（使用--make -O2，i5 quad 编译）：

main = print $ sum $ take (10 ^ 7) $ map length $ splitSubseq " " $ cycle "Testing "

70000000
   6,840,052,808 bytes allocated in the heap
       2,032,868 bytes copied during GC
          42,900 bytes maximum residency (2 sample(s))
          22,636 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     13114 colls,     0 par    0.06s    0.07s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0002s    0.0004s

  TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    3.68s  (  3.74s elapsed)
  GC      time    0.06s  (  0.07s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    3.74s  (  3.81s elapsed)

然后去嵌入求和和长度：

main = print $ sum $ take (10 ^ 7) $ map length $ repeat "Testing"

70000000
     240,052,572 bytes allocated in the heap
          12,812 bytes copied during GC
          42,900 bytes maximum residency (2 sample(s))
          22,636 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       458 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

  TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.09s  (  0.09s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.11s  (  0.09s elapsed)

所以我们可以看到，这仅花费了大约 0.1 秒的时间，留给我们大约 3.64 秒的时间让该算法拆分由 "Testing " 重复 1000 万次组成的字符串，所有这些都使用了少量内存。唯一的缺点是，当使用-threaded 编译并运行更多内核时，该算法实际上会显着减慢。

【讨论】：

如果单词之间有两个或多个空格怎么办？
那么你需要重新编写算法。这仅检查要拆分的单个字符。匹配子串有点复杂。但是，当前算法的行为是将多余的字符替换为空字符串，因此您只需在拆分后添加 filter (not . null) 即可。不要将其放入算法中，原因我将在我的答案的新更新中解释。
filter (not.null) 实际上是有道理的。
@MichaelChav 请记住，这个答案现在已经快 2 年了，因此较新版本的 GHC 可能具有不同的性能特征。我不确定为什么它在线程化时会变慢，这可能是在使用或不使用 -threaded 选项进行优化时启用了不同的优化。

【解决方案2】：

想象一下foldr 从右边构建它的结果：

splitWith f xs = case foldr g [[]] xs of {([]:r)-> r; r->r}
  where
    g x r@ ~(s:t) | f x = (x:s):t     -- keep `x` if `f x`
                  | null s = r        -- current word already empty
                  | otherwise = []:r  -- split

惰性模式允许无限列表作为输入。测试：

Prelude> splitWith (/= ' ') "  This is a    test  "
["This","is","a","test"]
Prelude> splitWith (/= ' ') ""
[]
Prelude> take 8 $ splitWith (/= ' ') (cycle "12   12 ")
["12","12","12","12","12","12","12","12"]

【讨论】：

【解决方案3】：

这是我在做练习时想出的。我所知道的所有 Haskell 都来自这本书，所以我的解决方案不应该包含到目前为止书中没有提到的任何结构：

splitWith pred (x:xs)
    | pred x    = let (first, rest) = span pred (x:xs)
                  in  first : (splitWith pred rest)
    | otherwise = splitWith pred xs
splitWith pred []    = []

【讨论】：

【解决方案4】：

splitWith' :: (a -> Bool) -> [a] -> [[a]]
splitWith' p xs = foldr with [[]] xs
  where
    with a acc@(as:rest) | p a       = (a:as):rest
                         | otherwise = []:acc

【讨论】：

splitWith' (/= ' ') "This is a test "（即'test'两边各有两个空格）==> ["This","is","a","","test","",""]; splitWith' (/= ' ') "" ==> [""]。应该调整逻辑，分别得到["This","is","a","test"] 和[]。
@WillNess 我删除了它仅仅是因为它是一个重复的问题。我现在取消删除了。另外我知道这个答案不能处理极端情况，但我认为它为 OP 提供了足够的提示来修改它以使其工作。赚你的午餐不是比只吃一顿更好吗？

【解决方案5】：

import Data.List (groupBy)
splitWith :: (a -> Bool) -> [a] -> [[a]]
splitWith p = filter (all p) . groupBy ((==) `on` p)

实际上any可以用来代替all，因为它更便宜并且groupBy保证[a]中p持有的元素是聚集在一起的（所以如果any 成立，那么all 也成立）；总而言之，p . head 也可以用来代替all p。

【讨论】：