【发布时间】:2021-07-03 20:45:30
【问题描述】:
我有两个大型 spark 数据框,用户详细信息和用户关系。 两个数据框都有超过 20M 的记录。
数据框joinJoin操作非常非常慢。请帮助我提高加入性能。
DF 详细信息
user_df
- 一个用户应该有多个地址。
- 一个用户+地址应该有多个网络
架构
root
|-- USERIDENTIFIER: string (nullable = true)
|-- ADDRESSIDENTIFIER: string (nullable = true)
|-- LATITUDE: decimal(15,12) (nullable = true)
|-- LONGITUDE: decimal(15,12) (nullable = true)
|-- USERNAME: string (nullable = true)
|-- NETWORKIDENTIFIER: string (nullable = true)
样本数据
+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+
|USERIDENTIFIER |ADDRESSIDENTIFIER |LATITUDE |LONGITUDE |USERNAME |NETWORKIDENTIFIER |
+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+
|C9BBB242202692B589DC5E6AD1040229|0B60DA9CB69084711BC119CB7DB5A120|33.779730000000|-117.867278000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0EBFEB7F15B503D7F34BA4650E561D4B|33.804552000000|-118.067973000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0DC22C71A345C6750158E88D98D6671D|33.701665000000|-117.956545000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|086E9420C60A7D037FB127727967337B|33.780334000000|-117.863353000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0E30A48D4829E093E60C7026351DFA04|33.780334000000|-117.863353000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|33CF41F3F7AC69029EDD664DF569AE41|33.610987000000|-117.712710000000|Mary A Wilkinson |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|97612FEFFD5EA7664E566161CE9318EF|33.569658000000|-117.726847000000|Mary A Wilkinson |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|33.665445000000|-117.761503000000|Mary A Wilkinson |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|A8EFFB0D29B8628B9A3E993490FF6F8F|33.439137000000|-117.621570000000|Mary A Wilkinson |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|33.665445000000|-117.761503000000|Mary A Wilkinson |06702F3EAF8846A450AB8A6DF93E8227|
+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+
user_relationship_df
- 一个用户地址可以与另一个用户有关系(所有地址)
架构
root
|-- PARENTUSERIDENTIFIER: string (nullable = true)
|-- PARENTADDRESSIDENTIFIER: string (nullable = true)
|-- CHILDUSERIDENTIFIER: string (nullable = true)
|-- NETWORKIDENTIFIER: string (nullable = true)
样本数据
+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
|PARENTUSERIDENTIFIER |PARENTADDRESSIDENTIFIER |CHILDUSERIDENTIFIER |NETWORKIDENTIFIER |
+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
|97B177D33281DF30AFC9924294D1973D|A8EFFB0D29B8628B9A3E993490FF6F8F|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|33CF41F3F7AC69029EDD664DF569AE41|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|97612FEFFD5EA7664E566161CE9318EF|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|C9BBB242202692B589DC5E6AD1040229|06702F3EAF8846A450AB8A6DF93E8227|
+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
我必须执行波纹管连接才能获取所有父子关系详细信息。
user_relationship_df = user_df.alias('U1').join(
user_relationship_df.alias('R'),
[
f_col('U1.UserIdentifier') == f_col('R.parentUserIdentifier'),
f_col('U1.addressIdentifier') == f_col('R.parentAddressIdentifier'),
f_col('U1.networkIdentifier') == f_col('R.networkIdentifier'),
],
'inner'
).join(
user_df.alias('U2'),
[
f_col('U2.UserIdentifier') == f_col('R.childUserIdentifier'),
f_col('U2.networkIdentifier') == f_col('R.networkIdentifier')
]
).select(
f_col('U1.UserIdentifier').alias('parentUserIdentifier'),
f_col('U1.AddressIdentifier').alias('parentAddressIdentifier'),
f_col('U2.UserName').alias('parentUserName'),
f_col('U1.latitude').alias('parentlatitude'),
f_col('U1.longitude').alias('parentlongitude'),
f_col('U2.UserIdentifier').alias('childUserIdentifier'),
f_col('U2.AddressIdentifier').alias('childadressIdentifier'),
f_col('U2.latitude').alias('childlatitude'),
f_col('U2.longitude').alias('childlongitude'),
f_col('U2.UserName').alias('childUserName'),
f_col('R.networkIdentifier').alias('networkIdentifier')
)
上面的join操作很慢。如何提高性能?
下面是join后的写操作。
加入后的预期输出
+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
|parentUserIdentifier|parentAddressIdentifier| parentUserName| parentlatitude| parentlongitude| childUserIdentifier|childadressIdentifier| childlatitude| childlongitude| childUserName| networkIdentifier|
+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
user_relationship_details_df.coalesce(1000).write.option(
'maxRecordsPerFile', 100000).mode('overwrite').format('json').save('hdfs_path')
用来执行的pyspark命令。
spark-submit script.py
【问题讨论】:
-
合并 1 ....
-
Coalesce 导致性能过低。请尝试根据数据量或大小增加。
-
@Srinivas 我已经将 coalesce 增加到 1000。但是没有太大的改进
-
@mck 你能帮帮我吗?
-
你能发布一些示例数据吗?两张桌子
标签: apache-spark pyspark apache-spark-sql