根据 Scala 和 Spark 中的频率替换二元组答案

【问题标题】：Replace bigrams based on their frequency in Scala and Spark根据 Scala 和 Spark 中的频率替换二元组
【发布时间】：2015-06-24 18:38:03
【问题描述】：

我想用这种模式(word1.concat("-").concat(word2)) 替换所有频率计数大于阈值的二元组，我已经尝试过：

import org.apache.spark.{SparkConf, SparkContext}

object replace {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()
      .setMaster("local")
      .setAppName("replace")

    val sc = new SparkContext(conf)
    val rdd = sc.textFile("data/ddd.txt")

    val threshold = 2

    val searchBigram=rdd.map {
      _.split('.').map { substrings =>
        // Trim substrings and then tokenize on spaces
        substrings.trim.split(' ').

          // Remove non-alphanumeric characters and convert to lowercase
          map {
          _.replaceAll( """\W""", "").toLowerCase()
        }.
          sliding(2)

      }.flatMap {
        identity
      }
        .map {
        _.mkString(" ")
      }
        .groupBy {
        identity
      }
        .mapValues {
        _.size
      }
    }.flatMap {
      identity
    }.reduceByKey(_ + _).collect
      .sortBy(-_._2)
      .takeWhile(_._2 >= threshold)
      .map(x=>x._1.split(' '))
      .map(x=>(x(0), x(1))).toVector


    val sample1 = sc.textFile("data/ddd.txt")
    val sample2 = sample1.map(s=> s.split(" ") // split on space
      .sliding(2)                       // take continuous pairs
      .map{ case Array(a, b) => (a, b) }
      .map(elem => if (searchBigram.contains(elem)) (elem._1.concat("-").concat(elem._2)," ") else elem)
      .map{case (e1,e2) => e1}.mkString(" "))
    sample2.foreach(println)
  }
}

但是当我在包含大量文档的文件上运行它时，此代码会删除每个文档的最后一个单词并显示一些错误。

假设我的输入文件包含这些文件：

surprise heard thump opened door small seedy man clasping package wrapped.

upgrading system found review spring two thousand issue moody audio mortgage backed.

omg left gotta wrap review order asap . understand issue moody hand delivered dali lama

speak hands wear earplugs lives . listen maintain link long .

buffered lightning two thousand volts cables burned revivification place .

cables volts cables finally able hear auditory issue moody gem long rumored music .

我最喜欢的输出是：

surprise heard thump opened door small-man clasping package wrapped.

upgrading system found review spring two-thousand issue-moody audio mortgage backed.

omg left gotta wrap review order asap . understand issue-moody hand delivered dali lama

speak hands wear earplugs lives . listen maintain link long small-man .

buffered lightning two-thousand volts-cables burned revivification place .

cables volts-cables finally able hear auditory issue-moody gem long rumored music .

谁能帮帮我？

【问题讨论】：

"当我在包含大量文档的文件上运行它时显示一些错误。"。什么错误？
scala.MatchError: [Ljava.lang.String;@6803a136 (of class [Ljava.lang.String;) at replace$$anonfun$8$$anonfun$apply$7.apply(replace.scala :74) 在替换$$anonfun$8$$anonfun$apply$7.apply(replace.scala:74)
您的代码的哪一行？此外，该代码似乎（有点）对我有用，因为顶部的二元组被替换 - 但由于您的算法，该对中的第二个仍然在滑动（2）对的下一个条目中，所以“伏特电缆”在输出中变为“伏特电缆”。所以你的替换方法需要改变。
但它对我显示了一些错误，并删除了遗言。你能帮帮我吗？
我认为是时候自己编写和调试了。你的算法不起作用（因为当你在源中得到“a b c”，并且 (a, b) 是你想要替换的二元组时，你考虑 a, b （并替换它）然后 b, c （并且不要'不要替代它）所以你得到“a-b b c”）。你不能使用滑动（2）。

标签： scala text apache-spark

【解决方案1】：

勺子喂食：

 case class Bigram(first: String, second: String) {

 def mkReplacement(s:String) = s.replaceAll(first + " " + second, first + "-" + second)
  }

 val data = List(
"surprise heard thump opened door small seedy man clasping package wrapped",
"upgrading system found review spring two thousand issue moody audio mortgage backed",
"omg left gotta wrap review order asap",
"understand issue moody hand delivered dali lama",
"speak hands wear earplugs lives . listen maintain link long",
"buffered lightning two thousand volts cables burned revivification place",
"cables volts cables finally able hear auditory issue moody gem long rumored music")

def stringToBigrams(s: String) = {
    val words = s.split(" ")
    if (words.size >= 2) {
      words.sliding(2).map(a => Bigram(a(0), a(1)))
    } else
      Iterator[Bigram]()
  }

val bigrams = data.flatMap { stringToBigrams }
//use reduceByKey rather than groupBy for Spark
val bigramCounts = bigrams.groupBy(identity).mapValues(_.size)

val threshold = 2
val topBigrams = bigramCounts.collect{case (b, c) if c >= threshold => b}

val replaced = data.map(r => 
      topBigrams.foldLeft(r)((r, b) => b.mkReplacement(r)))

replaced.foreach(println)
//> surprise heard thump opened door small seedy man clasping package wrapped
//| upgrading system found review spring two-thousand issue-moody audio mortgage backed
//| omg left gotta wrap review order asap
//| understand issue-moody hand delivered dali lama
//| speak hands wear earplugs lives . listen maintain link long
//| buffered lightning two-thousand volts-cables burned revivification place
//| cables volts-cables finally able hear auditory issue-moody gem long rumored music

【讨论】：

【解决方案2】：

def getNgrams(sentence):
    out = []
    sen = sentence.split(" ")
    for k in range(len(sen)-1):
        out.append((sen[k],sen[k+1]))
    return out    
if __name__ == '__main__':

    try:
        lsc = LocalSparkContext.LocalSparkContext("Recommendation","spark://BigData:7077")
        sc = lsc.getBaseContext()
        ssc = lsc.getSQLContext()
        inFile = "bigramstxt.txt"
        sen = sc.textFile(inFile,1)
        v = 1
        brv = sc.broadcast(v)
        wordgroups = sen.flatMap(getNgrams).map(lambda t: (t,1)).reduceByKey(add).filter(lambda t: t[1]>brv.value)
        bigrams = wordgroups.collect()
        sc.stop()
        inp = open(inFile,'r').read()
        print inp
        for b in bigrams:
            print b
            inp = inp.replace(" ".join(b[0]),"-".join(b[0]))

        print inp

    except:
        raise
        sc.stop()

【讨论】：

会是 +1，但对于 python 而不是 OP 语言的 -1 会取消它。