【发布时间】:2020-08-24 22:27:14
【问题描述】:
我有一个数据框df,格式如下
|constraint |constraint_status |constraint_msg
+----------------------------------------------------------------------------------------------------------------+--------------------------------+
|CompletenessConstraint |Success |Value: 1.0 Notnull condition should be satisfied
|UniquenessConstraint |Success |Value: 1.0 Uniqueness condition should be satisfied |
|PatternMatchConstraint |Failure |Expected type of column CHD_ACCOUNT_NUMBER to be StringType |
|MinimumConstraint |Success |Value: 5.1210650000005 Minimum value should be greater than 10.000000
|HistogramConstraint |Failure |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|
我想在Value: 字符串之后获取数值并创建一个新列Value。
预期输出
|constraint |constraint_status |constraint_msg |Value
+----------------------------------------------------------------------------------------------------------------+--------------------------------+
|CompletenessConstraint |Success |Value: 1.0 Notnull condition should be satisfied | 1.0
|UniquenessConstraint |Success |Value: 1.0 Uniqueness condition should be satisfied | 1.0
|PatternMatchConstraint |Failure |Expected type of column CHD_ACCOUNT_NUMBER to be StringType | null
|MinimumConstraint |Success |Value: 5.1210650000005 Minimum value should be greater than 10.000000 | 5.1210650000005
|HistogramConstraint |Failure |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000| null
我试过下面的代码:
df = df.withColumn("Value",split(df("constraint_msg"), "Value\\: (\\d+)").getItem(0))
但出现错误。需要帮助!
org.apache.spark.sql.AnalysisException: 由于数据类型不匹配,无法解析 'split(
constraint_msg, 'Value\: (\d+)')':参数 1 需要字符串类型,但是,'@987654328 @' 是数组类型。;;
【问题讨论】:
标签: regex scala apache-spark apache-spark-sql