如何将每种类型的列分成两组？答案

【问题标题】：How to split columns into two sets per type?如何将每种类型的列分成两组？
【发布时间】：2017-06-05 23:38:11
【问题描述】：

我有一个 CSV 输入文件。我们使用以下内容阅读了这一点

val rawdata = spark.
  read.
  format("csv").
  option("header", true).
  option("inferSchema", true).
  load(filename)

这会巧妙地读取数据并构建架构。

下一步是将列拆分为 String 列和 Integer 列。怎么样？

如果以下是我的数据集的架构...

scala> rawdata.printSchema
root
 |-- ID: integer (nullable = true)
 |-- First Name: string (nullable = true)
 |-- Last Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- DailyRate: integer (nullable = true)
 |-- Dept: string (nullable = true)
 |-- DistanceFromHome: integer (nullable = true)

我想将其拆分为两个变量（StringCols、IntCols），其中：

StringCols 应该有 "First Name","Last Name","Dept"
IntCols 应该有 "ID","Age","DailyRate","DistanceFromHome"

这是我尝试过的：

val names = rawdata.schema.fieldNames
val types = rawdata.schema.fields.map(r => r.dataType)

现在在types 中，我想循环查找所有StringType 并在名称中查找列名，与IntegerType 类似。

【问题讨论】：

标签： scala apache-spark apache-spark-sql

【解决方案1】：

在这里，您可以使用基础schema 和dataType 按类型过滤列

import org.apache.spark.sql.types.{IntegerType, StringType}

val stringCols = df.schema.filter(c => c.dataType == StringType).map(_.name)
val intCols = df.schema.filter(c => c.dataType == IntegerType).map(_.name)

val dfOfString = df.select(stringCols.head, stringCols.tail : _*)
val dfOfInt = df.select(intCols.head, intCols.tail : _*)

【讨论】：

谢谢，这就是我所看到的。我犯的错误是，不包括 sql.types.{IntegerType, StringType} 我正在执行以下操作并得到空列表 val tst = rawdata.schema.filter(c => c.dataType == "StringType") 而不是val tst = rawdata.schema.filter(c => c.dataType == StringType) 非常感谢。问候巴拉

【解决方案2】：

使用dtypes 运算符：

dtypes: Array[(String, String)] 以数组形式返回所有列名及其数据类型。

这将为您提供一种更惯用的方式来处理数据集的架构。

val rawdata = Seq(
  (1, "First Name", "Last Name", 43, 2000, "Dept", 0)
).toDF("ID", "First Name", "Last Name", "Age", "DailyRate", "Dept", "DistanceFromHome")
scala> rawdata.dtypes.foreach(println)
(ID,IntegerType)
(First Name,StringType)
(Last Name,StringType)
(Age,IntegerType)
(DailyRate,IntegerType)
(Dept,StringType)
(DistanceFromHome,IntegerType)

我想把它分成两个变量（StringCols，IntCols）

（如果您不介意，我宁愿坚持使用不可变值）

val emptyPair = (Seq.empty[String], Seq.empty[String])
val (stringCols, intCols) = rawdata.dtypes.foldLeft(emptyPair) { case ((strings, ints), (name: String, typ)) =>
  typ match {
    case _ if typ == "StringType" => (name +: strings, ints)
    case _ if typ == "IntegerType" => (strings, name +: ints)
  }
}

StringCols 应该有 "First Name","Last Name","Dept" 而 IntCols 应该有 "ID","Age","DailyRate","DistanceFromHome"

您可以 reverse 集合，但我宁愿避免这样做，因为性能昂贵且不给您任何回报。

【讨论】：