【发布时间】:2015-06-24 18:38:03
【问题描述】:
我想用这种模式(word1.concat("-").concat(word2)) 替换所有频率计数大于阈值的二元组,我已经尝试过:
import org.apache.spark.{SparkConf, SparkContext}
object replace {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local")
.setAppName("replace")
val sc = new SparkContext(conf)
val rdd = sc.textFile("data/ddd.txt")
val threshold = 2
val searchBigram=rdd.map {
_.split('.').map { substrings =>
// Trim substrings and then tokenize on spaces
substrings.trim.split(' ').
// Remove non-alphanumeric characters and convert to lowercase
map {
_.replaceAll( """\W""", "").toLowerCase()
}.
sliding(2)
}.flatMap {
identity
}
.map {
_.mkString(" ")
}
.groupBy {
identity
}
.mapValues {
_.size
}
}.flatMap {
identity
}.reduceByKey(_ + _).collect
.sortBy(-_._2)
.takeWhile(_._2 >= threshold)
.map(x=>x._1.split(' '))
.map(x=>(x(0), x(1))).toVector
val sample1 = sc.textFile("data/ddd.txt")
val sample2 = sample1.map(s=> s.split(" ") // split on space
.sliding(2) // take continuous pairs
.map{ case Array(a, b) => (a, b) }
.map(elem => if (searchBigram.contains(elem)) (elem._1.concat("-").concat(elem._2)," ") else elem)
.map{case (e1,e2) => e1}.mkString(" "))
sample2.foreach(println)
}
}
但是当我在包含大量文档的文件上运行它时,此代码会删除每个文档的最后一个单词并显示一些错误。
假设我的输入文件包含这些文件:
surprise heard thump opened door small seedy man clasping package wrapped.
upgrading system found review spring two thousand issue moody audio mortgage backed.
omg left gotta wrap review order asap . understand issue moody hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long .
buffered lightning two thousand volts cables burned revivification place .
cables volts cables finally able hear auditory issue moody gem long rumored music .
我最喜欢的输出是:
surprise heard thump opened door small-man clasping package wrapped.
upgrading system found review spring two-thousand issue-moody audio mortgage backed.
omg left gotta wrap review order asap . understand issue-moody hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long small-man .
buffered lightning two-thousand volts-cables burned revivification place .
cables volts-cables finally able hear auditory issue-moody gem long rumored music .
谁能帮帮我?
【问题讨论】:
-
"当我在包含大量文档的文件上运行它时显示一些错误。"。什么错误?
-
scala.MatchError: [Ljava.lang.String;@6803a136 (of class [Ljava.lang.String;) at replace$$anonfun$8$$anonfun$apply$7.apply(replace.scala :74) 在替换$$anonfun$8$$anonfun$apply$7.apply(replace.scala:74)
-
您的代码的哪一行?此外,该代码似乎(有点)对我有用,因为顶部的二元组被替换 - 但由于您的算法,该对中的第二个仍然在滑动(2)对的下一个条目中,所以“伏特电缆”在输出中变为“伏特电缆”。所以你的替换方法需要改变。
-
但它对我显示了一些错误,并删除了遗言。你能帮帮我吗?
-
我认为是时候自己编写和调试了。你的算法不起作用(因为当你在源中得到“a b c”,并且 (a, b) 是你想要替换的二元组时,你考虑 a, b (并替换它)然后 b, c (并且不要'不要替代它)所以你得到“a-b b c”)。你不能使用滑动(2)。
标签: scala text apache-spark