将 DataFrame 转换为 RDD 并将 RDD 动态拆分为与 DataFrame 相同数量的 Columns答案

【问题标题】：Convert a DataFrame to RDD and Split the RDD into the same number of Columns as DataFrame Dynamically将 DataFrame 转换为 RDD 并将 RDD 动态拆分为与 DataFrame 相同数量的 Columns
【发布时间】：2021-10-06 19:43:59
【问题描述】：

我正在尝试将 DataFrame 转换为 RDD，并根据 DataFrame 中的列数动态而优雅地将它们拆分为特定数量的列

即这是 hive 员工表中的示例数据

Id  Name    Age State   City
123 Bob 34  Texas   Dallas
456 Stan    26  Florida Tampa

val temp_df = spark.sql("Select * from employee")
val temp2_rdd = temp_df.rdd.map(x => (x(0),x(1),x(2),x(3))

我希望根据表中的列数动态生成 tem2_rdd。 不应该像我那样硬编码。

由于 Scala 中元组的最大大小为 22，任何其他可以有效保存 rdd 的集合。

编码语言：Spark Scala

请指教。

【问题讨论】：

标签： scala dataframe apache-spark rdd

【解决方案1】：

您可以使用 Row 对象的 toSeq 方法，而不是使用索引提取和转换每个元素。

  val temp_df = spark.sql("Select * from employee")
  // RDD[List[Any]]
  val temp2_rdd = temp_df.rdd.map(_.toSeq.toList)
  // RDD[List[String]]
  val temp3_rdd = temp_df.rdd.map(_.toSeq.map(_.toString).toList)

【讨论】：