火花数据框左外连接需要很多时间答案

【问题标题】：spark data frame left outer join is taking lot time火花数据框左外连接需要很多时间
【发布时间】：2018-08-19 16:17:28
【问题描述】：

我有两个数据帧 ipwithCounryName(12Mb) 和 ipLogs(1GB) 。我想加入两个基于公共列 ipRange 的数据框。 ipwithCounryName df i brodcasted 下面是我的代码。

   val ipwithCounryName_df = Init.iptoCountryBC.value
    ipwithCounryName_df .createOrReplaceTempView("inputTable")
    ipLogs.createOrReplaceTempView("ipTable")
    val joined_table= Init.getSparkSession.sql("SELECT hostname,date,path,status,content_size,inputTable.countryName FROM ipasLong Left JOIN inputTable ON ipasLongValue >= StartingRange AND ipasLongValue <= Endingrange")

=====实物计划===

*Project [hostname#34, date#98, path#36, status#37, content_size#105L, 
 countryName#5]
+- BroadcastNestedLoopJoin BuildRight, Inner, ((ipasLongValue#354L >= 
StartingRange#2L) && (ipasLongValue#354L <= Endingrange#3L))
:- *Project [UDF:IpToInt(hostname#34) AS IpasLongValue#354L, hostname#34, 
date#98, path#36, status#37, content_size#105L]
:  +- *Filter ((isnotnull(isIp#112) && isIp#112) && 
isnotnull(UDF:IpToInt(hostname#34)))
:     +- InMemoryTableScan [path#36, content_size#105L, isIp#112, 
hostname#34, date#98, status#37], [isnotnull(isIp#112), isIp#112, 
isnotnull(UDF:IpToInt(hostname#34))]
:           +- InMemoryRelation [hostname#34, date#98, path#36, status#37, 
content_size#105L, isIp#112], true, 10000, StorageLevel(disk, memory, 
deserialized, 1 replicas)
:                 +- *Project [hostname#34, cast(unix_timestamp(date#35, 
dd/MMM/yyyy:HH:mm:ss ZZZZ, Some(Asia/Calcutta)) as timestamp) AS date#98, 
path#36, status#37, CASE WHEN isnull(content_size#38L) THEN 0 ELSE 
content_size#38L END AS content_size#105L, UDF(hostname#34) AS isIp#112]
:                    +- *Filter (isnotnull(isBadData#45) && NOT isBadData#45)
:                       +- InMemoryTableScan [isBadData#45, hostname#34, 
status#37, path#36, date#35, content_size#38L], [isnotnull(isBadData#45), NOT 
isBadData#45]
:                             +- InMemoryRelation [hostname#34, date#35, 
path#36, status#37, content_size#38L, isBadData#45], true, 10000, 
StorageLevel(disk, memory, deserialized, 1 replicas)
:                                   +- *Project [regexp_extract(val#26, 
^([^\s]+\s), 1) AS hostname#34, regexp_extract(val#26, ^.* 
(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}), 1) AS date#35, 
regexp_extract(val#26, ^.*"\w+\s+([^\s]+)\s*[(HTTP)]*.*", 1) AS path#36, 
cast(regexp_extract(val#26, ^.*"\s+([^\s]+), 1) as int) AS status#37, 
cast(regexp_extract(val#26, ^.*\s+(\d+)$, 1) as bigint) AS content_size#38L, 
UDF(named_struct(hostname, regexp_extract(val#26, ^([^\s]+\s), 1), date, 
regexp_extract(val#26, ^.*(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}), 1), 
path, regexp_extract(val#26, ^.*"\w+\s+([^\s]+)\s*[(HTTP)]*.*", 1), status, 
cast(regexp_extract(val#26, ^.*"\s+([^\s]+), 1) as int), content_size, 
cast(regexp_extract(val#26, ^.*\s+(\d+)$, 1) as bigint))) AS isBadData#45]
:                                      +- *FileScan csv [val#26] Batched: 
false, Format: CSV, Location: 
InMemoryFileIndex[file:/C:/Users/M1047320/Desktop/access_log_Jul95], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<val:string>
+- BroadcastExchange IdentityBroadcastMode
+- *Project [StartingRange#2L, Endingrange#3L, CountryName#5]
     +- *Filter (isnotnull(StartingRange#2L) && isnotnull(Endingrange#3L))
        +- *FileScan csv [StartingRange#2L,Endingrange#3L,CountryName#5] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/C:/Users/M1047320/Documents/Spark-301/Documents/GeoIPCountryWhois.csv], PartitionFilters: [], PushedFilters: [IsNotNull(StartingRange), IsNotNull(Endingrange)], ReadSchema: struct<StartingRange:bigint,Endingrange:bigint,CountryName:string>

加入需要更多时间（>30 分钟）。我在两个相同大小的不同数据帧上多了一个内部连接，其中连接条件为“=”。它只需要5分钟。我应该如何改进我的代码？请推荐

【问题讨论】：

这是 spark 的一个已知问题：issues.apache.org/jira/browse/SPARK-8682

标签： apache-spark apache-spark-sql outer-join

【解决方案1】：

请将过滤条件保留在 where 并根据共同的列名加入表。我假设 countryname 是两个 DF 的共同点。

val joined_table= Init.getSparkSession.sql("SELECT hostname,date,path,status,content_size,inputTable.countryName FROM ipasLong Left JOIN inputTable ON ipasLong.countryName=inputTable.countryName
WHERE ipasLongValue >= StartingRange AND ipasLongValue <= Endingrange")

您也可以直接加入数据框。

val result=ipLogs.join(broadcast(ipwithCounryName),"joincondition","left_outer").where($"ipasLongValue" >= StartingRange && $"ipasLongValue" <= Endingrange).select("select columns")

希望对你有帮助。

【讨论】：

感谢您的帮助，但上述解决方案如何帮助减少执行时间。我关心的主要是减少执行时间

【解决方案2】：

您可以尝试将 JVM 参数增加到系统的容量以充分利用它，如下所示：

spark-submit --driver-memory 12G --conf spark.driver.maxResultSize=3g --executor-cores 6 --executor-memory 16G

【讨论】：