Scala：在数据框中映射列整数值答案

【问题标题】：Scala: Map column integer values in a dataframeScala：在数据框中映射列整数值
【发布时间】：2019-08-21 21:41:54
【问题描述】：

我有一个从城市到国家/地区 ID 的映射

    cityId, countryId
    1, 1200
    2, 1200
    3, 1200
    4, 3000
    5, 3000
    6, 4000

我的映射函数看起来像

    val mapCountry = df.rdd.map(x => (x.getInt(0), 
    x.getInt(1))).collectAsMap()

我有一个数据框，其中有名为 cityId 和 countryId 的列。在数据框中，cityId 和 countryId 都包含 cityId 值我想使用 map 函数替换 countryId 列。

    ft = mapGeography.foldLeft(ft)((acc, ca) => 
    acc.withColumnRenamed(ca._1, ca._2))

这给了我一个错误，应该是字符串，但我传递的是 int。当我在具有字符串值的列上运行它时，它可以工作。

任何人都知道如何调整它以使用 int

【问题讨论】：

标签： scala dictionary

【解决方案1】：

如果我正确理解您的问题，最好将cityId 上的两个数据框连接起来以选择想要的countryId，如下所示。

val dfCity = Seq(
  (1, 1200), (2, 1200), (3, 1200), (4, 3000), (5, 3000), (6, 4000)
).toDF("cityId", "countryId")

val dfGeography = Seq(
  (1, 1, 101),  (2, 2, 202), (4, 4, 404), (99, 99, 909)
).toDF("cityId", "countryId", "rank")

val nonIdCols = dfGeography.columns diff Array("cityId", "countryId")

dfGeography.
  join(dfCity, Seq("cityId"), "left_outer").
  select(dfGeography("cityId") +: dfCity("countryId") +: nonIdCols.map(col): _*).
  show
// +------+---------+----+
// |cityId|countryId|rank|
// +------+---------+----+
// |     1|     1200| 101|
// |     2|     1200| 202|
// |     4|     3000| 404|
// |    99|     null| 909|
// +------+---------+----+

请注意，如果dfCity 明显小于dfGeography，您可以考虑提供SQL query broadcast hint，只需在join() 表达式中将dfCity 替换为broadcast(dfCity)。

【讨论】：