如何为Scala中的列中的项目分配键？答案

【问题标题】：How to assign keys to items in a column in Scala?如何为Scala中的列中的项目分配键？
【发布时间】：2019-03-11 20:56:33
【问题描述】：

我有以下 RDD：

 Col1     Col2
"abc"    "123a"
"def"    "783b"
"abc     "674b"
"xyz"    "123a"
"abc"    "783b"

我需要以下输出，其中每列中的每个项目都转换为唯一键。 for example : abc->1,def->2,xyz->3

Col1      Col2
1          1
2          2
1          3
3          1
1          2

任何帮助将不胜感激。谢谢！

【问题讨论】：

标签： scala apache-spark mapreduce rdd

【解决方案1】：

在这种情况下，您可以依赖字符串的 hashCode。如果输入和数据类型相同，则哈希码将相同。试试这个。

scala> "abc".hashCode
res23: Int = 96354

scala> "xyz".hashCode
res24: Int = 119193

scala> val df = Seq(("abc","123a"),
     | ("def","783b"),
     | ("abc","674b"),
     | ("xyz","123a"),
     | ("abc","783b")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]

scala>

scala> def hashc(x:String):Int =
     | return x.hashCode
hashc: (x: String)Int

scala> val myudf = udf(hashc(_:String):Int)
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))

scala> df.select(myudf('col1), myudf('col2)).show
+---------+---------+
|UDF(col1)|UDF(col2)|
+---------+---------+
|    96354|  1509487|
|    99333|  1694000|
|    96354|  1663279|
|   119193|  1509487|
|    96354|  1694000|
+---------+---------+


scala>

【讨论】：

【解决方案2】：

如果您必须从 1 开始将列映射到 natural numbers，一种方法是将 zipWithIndex 应用于各个列，将 1 添加到索引（因为 zipWithIndex 始终从 0 开始），转换单个 RDD到 DataFrames，最后将转换后的 DataFrames 加入索引键：

val rdd = sc.parallelize(Seq(
  ("abc", "123a"),
  ("def", "783b"),
  ("abc", "674b"),
  ("xyz", "123a"),
  ("abc", "783b")
))

val df1 = rdd.map(_._1).distinct.zipWithIndex.
  map(r => (r._1, r._2 + 1)).
  toDF("col1", "c1key")

val df2 = rdd.map(_._2).distinct.zipWithIndex.
  map(r => (r._1, r._2 + 1)).
  toDF("col2", "c2key")

val dfJoined = rdd.toDF("col1", "col2").
  join(df1, Seq("col1")).
  join(df2, Seq("col2"))
// +----+----+-----+-----+
// |col2|col1|c1key|c2key|
// +----+----+-----+-----+
// |783b| abc|    2|    1|
// |783b| def|    3|    1|
// |123a| xyz|    1|    2|
// |123a| abc|    2|    2|
// |674b| abc|    2|    3|
//+----+----+-----+-----+

dfJoined.
  select($"c1key".as("col1"), $"c2key".as("col2")).
  show
// +----+----+
// |col1|col2|
// +----+----+
// |   2|   1|
// |   3|   1|
// |   1|   2|
// |   2|   2|
// |   2|   3|
// +----+----+

请注意，如果您可以让密钥从 0 开始，则在生成 df1 和 df2 时可以跳过 map(r => (r._1, r._2 + 1)) 的步骤。

【讨论】：