【发布时间】:2019-01-14 14:30:08
【问题描述】:
在我的 Scala 程序中,我有一个带有如下架构的数据框:
root
|-- FIRST_NAME: string (nullable = true)
|-- LAST_NAME: string (nullable = true)
|-- SEGMENT_EMAIL: array (nullable = true)
| |-- element: string (containsNull = true)
|-- SEGMENT_ADDRESS_STATE: array (nullable = true)
| |-- element: string (containsNull = true)
|-- SEGMENT_ADDRESS_POSTAL_CODE: array (nullable = true)
| |-- element: string (containsNull = true)
一些示例值是:
|FIRST_NAME |LAST_NAME |CONFIRMATION_NUMBER| SEGMENT_EMAIL|SEGMENT_ADDRESS_STATE|SEGMENT_ADDRESS_POSTAL_CODE|
+----------------+---------------+-------------------+--------------------+---------------------+---------------------------+
| Stine| Rocha| [48978451]|[Xavier.Vich@gmail..| [MA]| [01545-1300]|
| Aurora| Markusson| [26341542]| []| [AR]| [72716]|
| Stine| Rocha| [29828771]|[Xavier.Vich@gmail..| [OH]| [45101-9613]|
| Aubrey| Fagerland| [24572991]|[Aubrey.Fagerland...| []| []|
当列值采用列表形式时,如何根据名字 + 姓氏 + 电子邮件对相似记录进行分组。
我想要这样的输出:
|FIRST_NAME |LAST_NAME |CONFIRMATION_NUMBER | SEGMENT_EMAIL|SEGMENT_ADDRESS_STATE|SEGMENT_ADDRESS_POSTAL_CODE|
+----------------+---------------+---------------------+--------------------+---------------------+---------------------------+
| Stine| Rocha| [48978451, 29828771]|[Xavier.Vich@gmail..| [MA, OH]| [01545-1300, 45101-9613]|
| Aurora| Markusson| [26341542]| []| [AR]| [72716]|
| Aubrey| Fagerland| [24572991]|[Aubrey.Fagerland...| []| []|
谢谢!
【问题讨论】:
标签: scala apache-spark apache-spark-sql