Spark Scala Dataframe 描述非数字列答案

【问题标题】：Spark Scala Dataframe describe non numeric columnsSpark Scala Dataframe 描述非数字列
【发布时间】：2017-05-27 19:50:47
【问题描述】：

有没有类似describe()的函数用于非数字列？

我想收集有关我的表的“数据完整性”的统计信息。例如

记录总数
空值总数
特殊值的总数（例如 0、空字符串等）
不同值的总数
其他类似的东西...

data.describe() 只为数字列生成有趣的值（count、mean、stddev、min、max）。有什么适合字符串或其他类型的吗？

【问题讨论】：

我一直在 python pandas 中寻找类似的东西。我还没有找到明确的方法；但是，我注意到，当我的数据框仅由 object 类型的列（不是 int、float 等）组成时，df.describe() 显示 count、unique、top , freq 而不是 count, mean, std 等。
所以我建议您在使用的任何 api 中尝试以下操作：data["categorical_col1", "categorical_col2"].describe()

标签： scala apache-spark spark-dataframe apache-spark-mllib data-analysis

【解决方案1】：

没有。问题是数值数据的基本统计数据很便宜。对于分类数据，其中一些可能需要多次数据扫描和无限（就记录数而言是线性的）内存。

有些很便宜。例如计数NULL或空：Count number of non-NaN entries in each column of Spark dataframe with Pyspark

【讨论】：

【解决方案2】：

以下是获取相关字符串列统计信息的示例：

  def getStringColumnProfile(df: DataFrame, columnName: String): DataFrame = {
    df.select(columnName)
      .withColumn("isEmpty", when(col(columnName) === "", true).otherwise(null))
      .withColumn("isNull", when(col(columnName).isNull, true).otherwise(null))
      .withColumn("fieldLen", length(col(columnName)))
      .agg(
        max(col("fieldLen")).as("max_length"),
        countDistinct(columnName).as("unique"),
        count("isEmpty").as("is_empty"),
        count("isNull").as("is_null")
      )
      .withColumn("col_name", lit(columnName))
  }

    def profileStringColumns(df: DataFrame): DataFrame = {
      df.columns.filter(df.schema(_).dataType == StringType)
        .map(getStringColumnProfile(df, _))
        .reduce(_ union _)
        .toDF
        .select("col_name"
          , "unique"
          , "is_empty"
          , "is_null"
          , "max_length")
    }

数字列也是如此

  def getNumericColumnProfile(df: DataFrame, columnName: String): DataFrame = {
    df.select(columnName)
      .withColumn("isZero", when(col(columnName) === 0, true).otherwise(null))
      .withColumn("isNull", when(col(columnName).isNull, true).otherwise(null))
      .agg(
        max(col(columnName)).as("max"),
        count("isZero").as("is_zero"),
        count("isNull").as("is_null"),
        min(col(columnName)).as("min"),
        avg(col(columnName)).as("avg"),
        stddev(col(columnName)).as("std_dev")
      )
      .withColumn("col_name", lit(columnName))
  }

    def profileNumericColumns(df: DataFrame): DataFrame = {
      df.columns.filter(
        Set("DecimalType", "IntegerType", "LongType", "DoubleType", "FloatType", "ShortType")
          contains df.schema(_).dataType.toString
      )
        .map(getNumericColumnProfile(df, _))
        .reduce(_ union _)
        .toDF
        .select("col_name",
          "col_type",
          "is_null",
          "is_zero",
          "min",
          "max",
          "avg",
          "std_dev")
    }

【讨论】：

【解决方案3】：

这里有一些代码可以帮助解决分析非数字数据的问题。请看：
https://github.com/jasonsatran/spark-meta/

为了提高性能，我们可以对数据进行抽样或仅选择我们想要明确分析的列。

【讨论】：