在 Spark 数据框列中的特定字符串之后提取数字 - Scala答案

【问题标题】：Extracting number after specific string in Spark dataframe column - Scala在 Spark 数据框列中的特定字符串之后提取数字 - Scala
【发布时间】：2020-08-24 22:27:14
【问题描述】：

我有一个数据框df，格式如下

 |constraint                                     |constraint_status |constraint_msg                                                                                             
 +----------------------------------------------------------------------------------------------------------------+--------------------------------+
 |CompletenessConstraint                        |Success          |Value: 1.0 Notnull condition should be satisfied     
 |UniquenessConstraint                          |Success          |Value: 1.0 Uniqueness condition should be satisfied                            |
 |PatternMatchConstraint                        |Failure          |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                          |
 |MinimumConstraint                             |Success          |Value: 5.1210650000005 Minimum value should be greater than 10.000000 
 |HistogramConstraint                           |Failure          |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|

我想在Value: 字符串之后获取数值并创建一个新列Value。

预期输出

 |constraint                                     |constraint_status |constraint_msg                                                       |Value                                        
 +----------------------------------------------------------------------------------------------------------------+--------------------------------+
 |CompletenessConstraint                        |Success          |Value: 1.0 Notnull condition should be satisfied                          |     1.0
 |UniquenessConstraint                          |Success          |Value: 1.0 Uniqueness condition should be satisfied                       |     1.0 
 |PatternMatchConstraint                        |Failure          |Expected type of column CHD_ACCOUNT_NUMBER to be StringType               |     null
 |MinimumConstraint                             |Success          |Value: 5.1210650000005 Minimum value should be greater than 10.000000     |     5.1210650000005 
 |HistogramConstraint                           |Failure          |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000| null

我试过下面的代码：

      df = df.withColumn("Value",split(df("constraint_msg"), "Value\\: (\\d+)").getItem(0))

但出现错误。需要帮助！

org.apache.spark.sql.AnalysisException: 由于数据类型不匹配，无法解析 'split(constraint_msg, 'Value\: (\d+)')'：参数 1 需要字符串类型，但是，'@987654328 @' 是数组类型。;;

【问题讨论】：

标签： regex scala apache-spark apache-spark-sql

【解决方案1】：

when..otherwise 将帮助您首先过滤那些不包含Value: 的记录。假设 constraint_msg 始终以 Value: 开头，我将在拆分后选择第二个元素作为所需值。

val df = sc.parallelize(Seq(("CompletenessConstraint", "Success", "Value: 1.0 Notnull condition should be satisfied"), ("PatternMatchConstraint", "Failure", "Expected type of column CHD_ACCOUNT_NUMBER to be StringType"))).toDF("constraint", "constraint_status", "constraint_msg")

val df1 = df.withColumn("Value",when(col("constraint_msg").contains("Value:"),split(df("constraint_msg"), " ").getItem(1)).otherwise(null))

df1.show()
+--------------------+-----------------+--------------------+-----+
|          constraint|constraint_status|      constraint_msg|Value|
+--------------------+-----------------+--------------------+-----+
|CompletenessConst...|          Success|Value: 1.0 Notnul...|  1.0|
|PatternMatchConst...|          Failure|Expected type of ...| null|
+--------------------+-----------------+--------------------+-----+

【讨论】：

【解决方案2】：

检查下面的代码。

scala> df.show(false)
+----------------------+------------------+----------------------------------------------------------------------------------------------+
|constraint            |constraint_status |constraint_msg                                                                                |
+----------------------+------------------+----------------------------------------------------------------------------------------------+
|CompletenessConstraint|Success           |Value: 1.0 Notnull condition should be satisfied                                              |
|UniquenessConstraint  |Success           |Value: 1.0 Uniqueness condition should be satisfied                                           |
|PatternMatchConstraint|Failure           |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                                   |
|MinimumConstraint     |Success           |Value: 5.1210650000005 Minimum value should be greater than 10.000000                         |
|HistogramConstraint   |Failure           |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|
+----------------------+------------------+----------------------------------------------------------------------------------------------+


scala> df
.withColumn("Value",regexp_extract($"constraint_msg","Value: (\\d.\\d+)",1))
.show(false)
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
|constraint            |constraint_status |constraint_msg                                                                                |Value          |
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
|CompletenessConstraint|Success           |Value: 1.0 Notnull condition should be satisfied                                              |1.0            |
|UniquenessConstraint  |Success           |Value: 1.0 Uniqueness condition should be satisfied                                           |1.0            |
|PatternMatchConstraint|Failure           |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                                   |               |
|MinimumConstraint     |Success           |Value: 5.1210650000005 Minimum value should be greater than 10.000000                         |5.1210650000005|
|HistogramConstraint   |Failure           |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|               |
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+

【讨论】：