【问题标题】:Extracting number after specific string in Spark dataframe column - Scala在 Spark 数据框列中的特定字符串之后提取数字 - Scala
【发布时间】:2020-08-24 22:27:14
【问题描述】:

我有一个数据框df,格式如下

 |constraint                                     |constraint_status |constraint_msg                                                                                             
 +----------------------------------------------------------------------------------------------------------------+--------------------------------+
 |CompletenessConstraint                        |Success          |Value: 1.0 Notnull condition should be satisfied     
 |UniquenessConstraint                          |Success          |Value: 1.0 Uniqueness condition should be satisfied                            |
 |PatternMatchConstraint                        |Failure          |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                          |
 |MinimumConstraint                             |Success          |Value: 5.1210650000005 Minimum value should be greater than 10.000000 
 |HistogramConstraint                           |Failure          |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|

我想在Value: 字符串之后获取数值并创建一个新列Value

预期输出

 |constraint                                     |constraint_status |constraint_msg                                                       |Value                                        
 +----------------------------------------------------------------------------------------------------------------+--------------------------------+
 |CompletenessConstraint                        |Success          |Value: 1.0 Notnull condition should be satisfied                          |     1.0
 |UniquenessConstraint                          |Success          |Value: 1.0 Uniqueness condition should be satisfied                       |     1.0 
 |PatternMatchConstraint                        |Failure          |Expected type of column CHD_ACCOUNT_NUMBER to be StringType               |     null
 |MinimumConstraint                             |Success          |Value: 5.1210650000005 Minimum value should be greater than 10.000000     |     5.1210650000005 
 |HistogramConstraint                           |Failure          |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000| null  

我试过下面的代码:

      df = df.withColumn("Value",split(df("constraint_msg"), "Value\\: (\\d+)").getItem(0))

但出现错误。需要帮助!

org.apache.spark.sql.AnalysisException: 由于数据类型不匹配,无法解析 'split(constraint_msg, 'Value\: (\d+)')':参数 1 需要字符串类型,但是,'@987654328 @' 是数组类型。;;

【问题讨论】:

    标签: regex scala apache-spark apache-spark-sql


    【解决方案1】:

    when..otherwise 将帮助您首先过滤那些不包含Value: 的记录。假设 constraint_msg 始终以 Value: 开头,我将在拆分后选择第二个元素作为所需值。

    val df = sc.parallelize(Seq(("CompletenessConstraint", "Success", "Value: 1.0 Notnull condition should be satisfied"), ("PatternMatchConstraint", "Failure", "Expected type of column CHD_ACCOUNT_NUMBER to be StringType"))).toDF("constraint", "constraint_status", "constraint_msg")
    
    val df1 = df.withColumn("Value",when(col("constraint_msg").contains("Value:"),split(df("constraint_msg"), " ").getItem(1)).otherwise(null))
    
    df1.show()
    +--------------------+-----------------+--------------------+-----+
    |          constraint|constraint_status|      constraint_msg|Value|
    +--------------------+-----------------+--------------------+-----+
    |CompletenessConst...|          Success|Value: 1.0 Notnul...|  1.0|
    |PatternMatchConst...|          Failure|Expected type of ...| null|
    +--------------------+-----------------+--------------------+-----+
    

    【讨论】:

      【解决方案2】:

      检查下面的代码。

      scala> df.show(false)
      +----------------------+------------------+----------------------------------------------------------------------------------------------+
      |constraint            |constraint_status |constraint_msg                                                                                |
      +----------------------+------------------+----------------------------------------------------------------------------------------------+
      |CompletenessConstraint|Success           |Value: 1.0 Notnull condition should be satisfied                                              |
      |UniquenessConstraint  |Success           |Value: 1.0 Uniqueness condition should be satisfied                                           |
      |PatternMatchConstraint|Failure           |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                                   |
      |MinimumConstraint     |Success           |Value: 5.1210650000005 Minimum value should be greater than 10.000000                         |
      |HistogramConstraint   |Failure           |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|
      +----------------------+------------------+----------------------------------------------------------------------------------------------+
      
      
      scala> df
      .withColumn("Value",regexp_extract($"constraint_msg","Value: (\\d.\\d+)",1))
      .show(false)
      +----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
      |constraint            |constraint_status |constraint_msg                                                                                |Value          |
      +----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
      |CompletenessConstraint|Success           |Value: 1.0 Notnull condition should be satisfied                                              |1.0            |
      |UniquenessConstraint  |Success           |Value: 1.0 Uniqueness condition should be satisfied                                           |1.0            |
      |PatternMatchConstraint|Failure           |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                                   |               |
      |MinimumConstraint     |Success           |Value: 5.1210650000005 Minimum value should be greater than 10.000000                         |5.1210650000005|
      |HistogramConstraint   |Failure           |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|               |
      +----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
      

      【讨论】:

        猜你喜欢
        • 2019-03-16
        • 1970-01-01
        • 2018-04-10
        • 2022-08-10
        • 1970-01-01
        • 2020-04-18
        • 1970-01-01
        • 2021-10-30
        • 1970-01-01
        相关资源
        最近更新 更多