您可以使用 foldLeft 遍历集合,其累加器是 Map 和 String 的元组,以跟踪前一个单词的条件单词计数,然后是 collect,如下所示:
def foo(in: Iterator[String]): Iterator[String] =
in.foldLeft((Map.empty[String, Int], "")){ case ((m, prev), word) =>
val count = if (word == prev) m.getOrElse(word, 0) + 1 else 1
(m + (word -> count), word)
}._1.
collect{ case (word, count) if count > 1 => word }.
iterator
foo(Iterator("aaa", "aaa", "bb", "cc", "cc", "bb", "dd")).toList
// res1: List[String] = List("aaa", "cc")
要捕获重复的字数和索引,只需索引集合并对条件字数应用类似的策略:
def bar(in: Iterator[String]): Map[(String, Int), Int] =
in.zipWithIndex.foldLeft((Map.empty[(String, Int), Int], "", 0)){
case ((m, pWord, pIdx), (word, idx)) =>
val idx1 = if (word == pWord) idx min pIdx else idx
val count = if (word == pWord) m.getOrElse((word, idx1), 0) + 1 else 1
(m + ((word, idx1) -> count), word, idx1)
}._1.
filter{ case ((_, _), count) => count > 1 }
bar(Iterator("aaa", "aaa", "bb", "cc", "cc", "bb", "dd", "cc", "cc", "cc"))
// res2: Map[(String, Int), Int] = Map(("cc", 7) -> 3, ("cc", 3) -> 2, ("aaa", 0) -> 2)
更新:
根据修订后的要求,为了最大限度地减少内存使用,一种方法是通过删除计数 1 的元素(如果重复的单词很少,这将是大多数)来保持 Map 的最小尺寸 -在foldLeft 遍历期间飞行。下面方法baz是bar的修改版:
def baz(in: Iterator[String]): Map[(String, Int), Int] =
(in ++ Iterator("")).zipWithIndex.
foldLeft((Map.empty[(String, Int), Int], (("", 0), 0), 0)){
case ((m, pElem, pIdx), (word, idx)) =>
val sameWord = word == pElem._1._1
val idx1 = if (sameWord) idx min pIdx else idx
val count = if (sameWord) m.getOrElse((word, idx1), 0) + 1 else 1
val elem = ((word, idx1), count)
val newMap = m + ((word, idx1) -> count)
if (sameWord) {
(newMap, elem, idx1)
} else
if (pElem._2 == 1)
(newMap - pElem._1, elem, idx1)
else
(newMap, elem, idx1)
}._1.
filter{ case ((word, _), _) => word != "" }
baz(Iterator("aaa", "aaa", "bb", "cc", "cc", "bb", "dd", "cc", "cc", "cc"))
// res3: Map[(String, Int), Int] = Map(("aaa", 0) -> 2, ("cc", 3) -> 2, ("cc", 7) -> 3)
请注意,附加到输入集合的虚拟空字符串是为了确保最后一个单词也得到正确处理。