如何使用 scala 从文件中读取输入并将文件的数据行转换为 List[Map[Int,String]]？答案

【问题标题】：How to read input from a file and convert data lines of the file to List[Map[Int,String]] using scala?如何使用 scala 从文件中读取输入并将文件的数据行转换为 List[Map[Int,String]]？
【发布时间】：2015-01-12 16:39:31
【问题描述】：

我的查询是，从文件中读取输入并使用 scala 将文件的数据行转换为 List[Map[Int,String]]。这里我给出一个数据集作为输入。我的代码是，

  def id3(attrs: Attributes,
      examples: List[Example],
      label: Symbol
       ) : Node = {
level = level+1


  // if all the examples have the same label, return a new node with that label

  if(examples.forall( x => x(label) == examples(0)(label))){
  new Leaf(examples(0)(label))
  } else {
  for(a <- attrs.keySet-label){          //except label, take all attrs
    ("Information gain for %s is %f".format(a,
      informationGain(a,attrs,examples,label)))
  }


  // find the best splitting attribute - this is an argmax on a function over the list

  var bestAttr:Symbol = argmax(attrs.keySet-label, (x:Symbol) =>
    informationGain(x,attrs,examples,label))




  // now we produce a new branch, which splits on that node, and recurse down the nodes.

  var branch = new Branch(bestAttr)

  for(v <- attrs(bestAttr)){


    val subset = examples.filter(x=> x(bestAttr)==v)



    if(subset.size == 0){
      // println(levstr+"Tiny subset!")
      // zero subset, we replace with a leaf labelled with the most common label in
      // the examples
      val m = examples.map(_(label))
      val mostCommonLabel = m.toSet.map((x:Symbol) => (x,m.count(_==x))).maxBy(_._2)._1
      branch.add(v,new Leaf(mostCommonLabel))

    }
    else {
      // println(levstr+"Branch on %s=%s!".format(bestAttr,v))

      branch.add(v,id3(attrs,subset,label))
    }
   }
  level = level-1
  branch
  }
  }
  }
object samplet {
def main(args: Array[String]){

var attrs: sample.Attributes = Map()
attrs += ('0 -> Set('abc,'nbv,'zxc))
attrs += ('1 -> Set('def,'ftr,'tyh))
attrs += ('2 -> Set('ghi,'azxc))
attrs += ('3 -> Set('jkl,'fds))
attrs += ('4 -> Set('mno,'nbh))



val examples: List[sample.Example] = List(
  Map(
    '0 -> 'abc,
    '1 -> 'def,
    '2 -> 'ghi,
    '3 'jkl,
    '4 -> 'mno
  ),
  ........................
  )


// obviously we can't use the label as an attribute, that would be silly!
val label = 'play

println(sample.try(attrs,examples,label).getStr(0))

}
}

但我如何将此代码更改为 - 接受来自 .csv 文件的输入？

【问题讨论】：

标签： scala csv map reduce rdd

【解决方案1】：

我建议你使用 Java 的 io/nio 标准库来读取你的 CSV 文件。我认为这样做没有相关的缺点。

但我们需要回答的第一个问题是在哪里读取代码中的文件？解析后的输入似乎替换了examples 的值。这一事实也提示我们解析后的 CSV 输入必须具有什么 type，即 List[Map[Symbol, Symbol]]。所以让我们声明一个新类

class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
  def getInput(file: Path): List[Map[Symbol, Symbol]] = ???
}

请注意，Charset 仅在我们必须区分不同编码的 CSV 文件时才需要。

好的，那么我们如何实现这个方法呢？它应该执行以下操作：

创建合适的输入阅读器
阅读所有行
在逗号分隔符处分割每一行
将每个子字符串转换成它所代表的符号
从符号列表构建地图，使用 attributes 作为键
创建并返回地图列表

或者用代码表示：

class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
  val Attributes = List('outlook, 'temperature, 'humidity, 'wind, 'play)
  val Separator = ","

  /** Get the desired input from the CSV file. Does not perform any checks, i.e., there are no guarantees on what happens if the input is malformed. */
  def getInput(file: Path): List[Map[Symbol, Symbol]] = {
    val reader = Files.newBufferedReader(file, charset)
    /* Read the whole file and discard the first line */
    inputWithHeader(reader).tail
  }

  /** Reads all lines in the CSV file using [[java.io.BufferedReader]] There are many ways to do this and this is probably not the prettiest. */
  private def inputWithHeader(reader: BufferedReader): List[Map[Symbol, Symbol]] = {
    (JavaConversions.asScalaIterator(reader.lines().iterator()) foldLeft Nil.asInstanceOf[List[Map[Symbol, Symbol]]]){
      (accumulator, nextLine) =>
        parseLine(nextLine) :: accumulator
    }.reverse
  }

  /** Parse an entry. Does not verify the input: If there are less attributes than columns or vice versa, zip creates a list of the size of the shorter list */
  private def parseLine(line: String): Map[Symbol, Symbol] = (Attributes zip (line split Separator map parseSymbol)).toMap

  /** Create a symbol from a String... we could also check whether the string represents a valid symbol */
  private def parseSymbol(symbolAsString: String): Symbol = Symbol(symbolAsString)
}

警告：只需要有效输入，我们可以确定各个符号表示不包含逗号分隔字符。如果不能假设，那么代码将无法拆分某些有效的输入字符串。

要使用这个新代码，我们可以改变main-方法如下：

def main(args: Array[String]){
  val csvInputFile: Option[Path] = args.headOption map (p => Paths get p)
  val examples = (csvInputFile map new InputFromCsvLoader().getInput).getOrElse(exampleInput)
  // ... your code

这里，examples 使用值exampleInput，如果没有指定输入参数，则它是examples 的当前硬编码值。

重要提示：在代码中，为方便起见，所有的错误处理都被省略了。在大多数情况下，从文件读取时可能会发生错误，并且必须将用户输入视为无效，因此可悲的是，程序边界处的错误处理通常不是可选的。

旁注：

尽量不要在代码中使用null。返回Option[T] 是比返回null 更好的选择，因为它使“nullness”显式化并且由于类型系统而提供静态安全。
return-关键字在 Scala 中不是必需的，因为总是返回方法的最后一个值。如果您发现代码更具可读性或者您想在方法的中间中断（这通常是个坏主意），您仍然可以使用关键字。
首选val 而不是var，因为不可变值比可变值更容易理解。
代码将因提供的 CSV 字符串而失败，因为它包含符号 TRUE 和 FALSE，根据您的程序逻辑，它们是不合法的（它们应该是 true 和 false）。
将所有信息添加到错误消息中。您的错误消息只告诉我属性 'wind 的值是什么不好，但它并没有告诉我实际值是什么。

【讨论】：

I gt error in val reader = Files.newBufferedReader(file, charset) => cannot resolve symbol newBufferedReader, val csvInputFile: Option[Path] = args.headOption map (p => Paths get p) => 无法解析符号路径，(JavaConversions.asScalaIterator(reader.lines().iterator()) => 无法解析符号行()。
@rosy 很抱歉：我没有看到您的 cmets 的通知。我写的Java代码需要Java 8（reader.lines()）和Java 7（Files.newBufferedReader()和Paths.get()，以及类型Path）；该错误很可能与您使用较旧版本的 Java 有关。但是，您可以从java.io.File 创建一个BufferedReader 来读取文件。有很多方法可以读取文件，其中最常见的肯定可以在 StackOverflow 上找到。我的观点很简单，使用 Java 标准库提供的 I/O 功能没有缺点。

【解决方案2】：

读取一个 csv 文件，

val datalines = Source.fromFile(filepath).getLines()

所以这个 datalines 包含 csv 文件中的所有行。

接下来，将每一行转换成一个Map[Int,String]

val datamap = datalines.map{ line =>
    line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
    }

在这里，我们用 "," 分隔每一行。然后构造一个映射，其中 key 作为 column 编号， value 作为拆分后的每个 word。

接下来，如果我们想要List[Map[Int,String]]，

val datamap = datalines.map{ line =>
    line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
    }.toList

【讨论】：