【问题标题】:Get WrappedArray row valule and convert it into string in Scala获取 WrappedArray 行值并将其转换为 Scala 中的字符串
【发布时间】:2019-01-14 04:45:12
【问题描述】:

我有一个如下所示的数据框

+---------------------------------------------------------------------+
|value                                                                |
+---------------------------------------------------------------------+
|[WrappedArray(LineItem_organizationId, LineItem_lineItemId)]         |
|[WrappedArray(OrganizationId, LineItemId, SegmentSequence_segmentId)]|
+---------------------------------------------------------------------+

从上面两行我想创建一个这种格式的字符串

"LineItem_organizationId", "LineItem_lineItemId"
"OrganizationId", "LineItemId", "SegmentSequence_segmentId"

我想将其创建为动态的,因此在第一列中存在第三个值,我的字符串将具有另外一个分隔列值。

如何在 Scala 中做到这一点。

这就是我为创建数据框所做的工作

 val xmlFiles = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML"
    val discriptorFileLOcation = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//REFXML"
    import sqlContext.implicits._

    val dfDiscriptor = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "FlatFileDescriptor").load(discriptorFileLOcation)
    dfDiscriptor.printSchema()
    val firstColumn = dfDiscriptor.select($"FFFileType.FFRecord.FFField").as("FFField")
    val FirstColumnOfHeaderFile = firstColumn.select(explode($"FFField")).as("ColumnsDetails").select(explode($"col")).first.get(0).toString().split(",")(5)
    println(FirstColumnOfHeaderFile)
    //dfDiscriptor.printSchema()
    val primaryKeyColumnsFinancialLineItem = dfDiscriptor.select(explode($"FFFileType.FFRecord.FFPrimKey.FFPrimKeyCol"))
    primaryKeyColumnsFinancialLineItem.show(false)

添加完整架构

   root
 |-- FFColumnDelimiter: string (nullable = true)
 |-- FFContentItem: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _ffMajVers: long (nullable = true)
 |    |-- _ffMinVers: double (nullable = true)
 |-- FFFileEncoding: string (nullable = true)
 |-- FFFileType: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- FFPhysicalFile: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- FFFileName: string (nullable = true)
 |    |    |    |    |-- FFRowCount: long (nullable = true)
 |    |    |-- FFRecord: struct (nullable = true)
 |    |    |    |-- FFField: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- FFColumnNumber: long (nullable = true)
 |    |    |    |    |    |-- FFDataType: string (nullable = true)
 |    |    |    |    |    |-- FFFacets: struct (nullable = true)
 |    |    |    |    |    |    |-- FFMaxLength: long (nullable = true)
 |    |    |    |    |    |    |-- FFTotalDigits: long (nullable = true)
 |    |    |    |    |    |-- FFFieldIsOptional: boolean (nullable = true)
 |    |    |    |    |    |-- FFFieldName: string (nullable = true)
 |    |    |    |    |    |-- FFForKey: struct (nullable = true)
 |    |    |    |    |    |    |-- FFForKeyCol: string (nullable = true)
 |    |    |    |    |    |    |-- FFForKeyRecord: string (nullable = true)
 |    |    |    |-- FFPrimKey: struct (nullable = true)
 |    |    |    |    |-- FFPrimKeyCol: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- FFRecordType: string (nullable = true)
 |-- FFHeaderRow: boolean (nullable = true)
 |-- FFId: string (nullable = true)
 |-- FFRowDelimiter: string (nullable = true)
 |-- FFTimeStamp: string (nullable = true)
 |-- _env: string (nullable = true)
 |-- _ffMajVers: long (nullable = true)
 |-- _ffMinVers: double (nullable = true)
 |-- _ffPubstyle: string (nullable = true)
 |-- _schemaLocation: string (nullable = true)
 |-- _sr: string (nullable = true)
 |-- _xmlns: string (nullable = true)
 |-- _xsi: string (nullable = true)

【问题讨论】:

    标签: scala apache-spark apache-spark-sql


    【解决方案1】:

    查看您给定的dataframe

    +---------------------------------------------------------------------+
    |value                                                                |
    +---------------------------------------------------------------------+
    |[WrappedArray(LineItem_organizationId, LineItem_lineItemId)]         |
    |[WrappedArray(OrganizationId, LineItemId, SegmentSequence_segmentId)]|
    +---------------------------------------------------------------------+
    

    它必须有以下schema

     |-- value: array (nullable = true)
     |    |-- element: array (containsNull = true)
     |    |    |-- element: string (containsNull = true)
    

    如果上述假设成立,那么你应该写一个udf函数为

    import org.apache.spark.sql.functions._
    def arrayToString = udf((arr: collection.mutable.WrappedArray[collection.mutable.WrappedArray[String]]) => arr.flatten.mkString(", "))
    

    并在dataframe中使用它作为

    df.withColumn("value", arrayToString($"value"))
    

    你应该有

    +-----------------------------------------------------+
    |value                                                |
    +-----------------------------------------------------+
    |LineItem_organizationId, LineItem_lineItemId         |
    |OrganizationId, LineItemId, SegmentSequence_segmentId|
    +-----------------------------------------------------+
    
     |-- value: string (nullable = true)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-06-13
      相关资源
      最近更新 更多