Azure 数据流：数组到列答案

【问题标题】：Azure Data Flow: Array to columnsAzure 数据流：数组到列
【发布时间】：2023-02-03 11:52:47
【问题描述】：

在我的数据流中，我有一个带有数组的列，我需要将它映射到列。以下是数据示例：

["title:mr","name:jon","surname:smith"]
[surname:jane"]
["title:mrs","surname:peters"]
["title:mr"]

这是所需结果的示例：

实现这一目标的最佳方法是什么？

【问题讨论】：

标签： azure-data-factory azure-data-factory-2 azure-data-factory-pipeline azure-data-flow

【解决方案1】：

您可以使用派生列、排名和数据透视转换的组合来执行此操作。

假设我将给定的示例数据（字符串数组）作为列mycol。

现在，我已经使用了rank转换。我为排名列提供了列名id，并使用mycol列作为排序条件（升序）。结果如下所示：

现在我已经使用派生列创建了一个动态表达式为unfold(mycol) 的new 列。

由于某种原因，这个新列的类型没有正确呈现。因此，我使用 cast 使其成为具有复杂类型定义的复杂类型 string[]。
我创建了 2 个新列 key 和 value。动态内容如下：

key: split(new[1],':')[1]
value: split(new[1],':')[2]

现在我已经使用了pivot转换。在这里，我在 id 上使用分组依据，选择数据透视列作为 key，选择数据透视列作为 max(value)（因为必须使用聚合）。

获得所需的结果。以下是整个数据流 JSON（实际转换从排名开始，因为您已经有了数组列。）

{
    "name": "dataflow1",
    "properties": {
        "type": "MappingDataFlow",
        "typeProperties": {
            "sources": [
                {
                    "dataset": {
                        "referenceName": "csv1",
                        "type": "DatasetReference"
                    },
                    "name": "source1"
                }
            ],
            "sinks": [
                {
                    "dataset": {
                        "referenceName": "dest",
                        "type": "DatasetReference"
                    },
                    "name": "sink1"
                }
            ],
            "transformations": [
                {
                    "name": "derivedColumn1"
                },
                {
                    "name": "rank1"
                },
                {
                    "name": "derivedColumn2"
                },
                {
                    "name": "cast1"
                },
                {
                    "name": "derivedColumn3"
                },
                {
                    "name": "pivot1"
                }
            ],
            "scriptLines": [
                "source(output(",
                "          mycol as string",
                "     ),",
                "     allowSchemaDrift: true,",
                "     validateSchema: false,",
                "     ignoreNoFilesFound: false) ~> source1",
                "source1 derive(mycol = split(replace(replace(replace(mycol,'[',''),']',''),'"',''),',')) ~> derivedColumn1",
                "derivedColumn1 rank(asc(mycol, true),",
                "     output(id as long)) ~> rank1",
                "rank1 derive(new = unfold(mycol)) ~> derivedColumn2",
                "derivedColumn2 cast(output(",
                "          new as string[]",
                "     ),",
                "     errors: true) ~> cast1",
                "cast1 derive(key = split(new[1],':')[1],",
                "          value = split(new[1],':')[2]) ~> derivedColumn3",
                "derivedColumn3 pivot(groupBy(id),",
                "     pivotBy(key),",
                "     {} = max(value),",
                "     columnNaming: '$N$V',",
                "     lateral: true) ~> pivot1",
                "pivot1 sink(allowSchemaDrift: true,",
                "     validateSchema: false,",
                "     partitionFileNames:['op.csv'],",
                "     umask: 0022,",
                "     preCommands: [],",
                "     postCommands: [],",
                "     skipDuplicateMapInputs: true,",
                "     skipDuplicateMapOutputs: true,",
                "     saveOrder: 1,",
                "     partitionBy('hash', 1)) ~> sink1"
            ]
        }
    }
}

【讨论】：