Scala MapReduce 框架给出类型不匹配答案

【问题标题】：Scala MapReduce Framework giving Type MismatchScala MapReduce 框架给出类型不匹配
【发布时间】：2014-11-05 21:00:08
【问题描述】：

我在 Scala 中有一个 MapReduce 框架，它基于几个 org.apache.hadoop 库。它适用于一个简单的字数统计程序。但是，我想将它应用于有用的东西并且遇到了障碍。我想获取一个 csv 文件（或任何分隔符）并将第一列中的任何内容作为键传递，然后计算键的发生率。

映射器代码如下所示

class WordCountMapper extends Mapper[LongWritable, Text, Text, LongWritable] with HImplicits {
  protected override def map(lnNumber: LongWritable, line: Text, context: Mapper[LongWritable, Text, Text, LongWritable]#Context): Unit = {
  line.split(",", -1)(0) foreach (context.write(_,1))  //Splits data
  }
}

问题出在“line.split”代码中。当我尝试编译它时，我收到一条错误消息：

找到：字符必需：org.apache.hadoop.io.Text。

line.split... 应该返回一个字符串，该字符串正在传递给 write(_,1) 中的 _，但由于某些原因，它认为它是一个字符。我什至添加了 .toString 以明确地将其设为字符串，但这也不起作用。

任何想法都值得赞赏。让我知道我可以提供哪些其他详细信息。

更新：

这里是进口清单：

import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.{Reducer, Job, Mapper}
import org.apache.hadoop.conf.{Configured}
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
import scala.collection.JavaConversions._
import org.apache.hadoop.util.{ToolRunner, Tool}

这里是 build.sbt 代码：

import AssemblyKeys._ // put this at the top of the file

assemblySettings

organization := "scala"

name := "WordCount"

version := "1.0"

scalaVersion:= "2.11.2"

scalacOptions ++= Seq("-no-specialization", "-deprecation")

libraryDependencies ++= Seq("org.apache.hadoop" % "hadoop-client" % "1.2.1",
                        "org.apache.hadoop" % "hadoop-core" % "latest.integration" exclude ("hadoop-core", "org/apache/hadoop/hdfs/protocol/ClientDatanodeProtocol.class") ,
                        "org.apache.hadoop" % "hadoop-common" % "2.5.1",
                        "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.5.1",
                        "commons-configuration" % "commons-configuration" % "1.9",
                        "org.apache.hadoop" % "hadoop-hdfs" % "latest.integration")


 jarName in assembly := "WordCount.jar"

 mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
  {case s if s.endsWith(".class") => MergeStrategy.last
case s if s.endsWith(".xsd") => MergeStrategy.last
case s if s.endsWith(".dtd") => MergeStrategy.last
case s if s.endsWith(".xml") => MergeStrategy.last
case s if s.endsWith(".properties") => MergeStrategy.last
case x => old(x)
  }
}

【问题讨论】：

能否提供您的导入和 build.sbt 或依赖项列表，以便我尝试编译它？
line 是一个“Hadoop 可写”Text，您需要调用toString 以从中获取支持拆分的Java 字符串。您应该告诉我们您在拨打此电话时遇到的错误。
@ThomasJungblut ，您的意思是使用“line.split(",",-1)(0).toString" 吗？这会产生与上述相同的错误。
@EricZoerner 我添加了您要求的信息。如果还有其他事情，请告诉我。
@JCalbreath 不，我添加了一个答案。 line 需要是一个字符串。

标签： java scala hadoop mapreduce

【解决方案1】：

我猜line 在这里被隐式转换为String（感谢HImplicits？）。然后我们有

line.split(",", -1)(0) foreach somethigOrOther

将字符串拆分为多个字符串 - .split(...)
取这些字符串的第零个 - (0)
然后迭代somethingOrOther 在该字符串的字符上 - foreach

这样你就得到了你的char。

【讨论】：

【解决方案2】：

我实际上通过不使用 _ 表示法而是直接在 context.write 中指定值来解决这个问题。所以而不是：

line.split(",", -1)(0) foreach (context.write(_,1))

我用过：

context.write(line.split(",", -1)(0), 1)

我在网上找到了一篇文章，它说有时 Scala 在使用 _ 时会混淆数据类型，并建议在适当的位置明确定义值。不确定这是否属实，但它解决了这种情况下的问题。

【讨论】：

与_无关，与不再不必要地调用foreach有关。
这是有道理的。我猜想是因为它只传递了一个字符串（第 0 项），所以这并不重要，只会在该一项上迭代一次。但我猜它正在迭代那个字符串中的每个字符。