【问题标题】:Calculate the frequency of each word in the file计算文件中每个单词出现的频率
【发布时间】:2020-11-15 07:31:33
【问题描述】:

我正在尝试构建一种算法,该算法将表示 Scala 编程语言中每个单词的频率。 我使用此函数(如下)从 2 个不同的文件创建了 2 个地图,现在我想将 t1.txt 中每个单词出现的数量除以 gen-voc.txt 中每个单词出现的数量,以计算频率。所以我需要一个合适的算法来做到这一点。

    import scala.io.Source
import scala.collection.mutable
import scala.collection.immutable.ListMap
object Project1 extends App {

  def buildRepresentation(content: String): mutable.Map[String, Int] = {
    val vector = mutable.Map.empty[String, Int]
    // use sequences of <space> , ! . to split the string
    val arrayOfWords = content.split("[ ,!.]+")
    for (rawWord <- arrayOfWords) {
      val word = rawWord.toLowerCase
      vector(word) = vector.getOrElse(word, 0) + 1
    }
    vector
  }

  ////Import t1.txt & gen-voc.txt data files:
  val data_t1 = "t1.txt"
  val data_voc = "gen-voc.txt"
  for (line <- Source.fromFile(data_t1).getLines) {}
  for (line <- Source.fromFile(data_voc).getLines) {}

  //get all of the lines from the file as one String:
  val t1 = Source.fromFile(data_t1).getLines.mkString
  val gen_voc = Source.fromFile(data_voc).getLines.mkString

【问题讨论】:

标签: java algorithm scala apache-spark


【解决方案1】:

您可以将此实现用于 scala 中的 buildRepresentation(更面向功能)

import scala.io.Source

object Project1 extends App {

  // use sequences of <space> , ! . to split the string
  def buildRepresentation(content: String): Map[String, Int] =
    content
      .split("[ ,!.]+")
      .map(_.toLowerCase)
      .groupBy(identity)
      .mapValues(_.length)

  ////Import t1.txt & gen-voc.txt data files:
  val data_t1 = "t1.txt"
  val data_voc = "gen-voc.txt"

  //get all of the lines from the file as one String:
  val t1 = Source.fromFile(data_t1).getLines.mkString
  val gen_voc = Source.fromFile(data_voc).getLines.mkString

  val t1repr: Map[String, Int] = buildRepresentation(t1)
  val genVocRepr: Map[String, Int] = buildRepresentation(gen_voc)

  // frequency for keys both in items and reference
  def frequency[A](items: Map[A, Int], reference: Map[A, Int]) =
    items.keySet.intersect(reference.keySet)
        .map(k => k -> items.getOrElse(k, 0).toDouble / reference.getOrElse(k, 0).toDouble )
        .toMap

  val frequencyOfT1RelativeToGenVoc: Map[String, Double] = frequency(t1repr, genVocRepr)
}

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2023-03-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多