【问题标题】:Apache Spark: collect into an array intersectionsApache Spark:收集到数组交叉点
【发布时间】:2021-11-11 18:22:46
【问题描述】:

假设一个数据框有两列:C1 和 C2

+---+-----+
|C1 | C2  |
+---+-----+
|A  |  B  |
|C  |  D  |
|A  |  E  |
|E  |  F  |
+---+-----+

我的目标是:收集到数组交叉点

+--------------+
| intersections|
+--------------+
|[A, B, E, F]  |
|[C, D]        |
+--------------+

如果数据帧有大量行(约 10 亿),如何做好

【问题讨论】:

  • 这个问题最好使用网络图方法来解决。将数据加载到图中,其中两列的不同值是节点,列之间的对是边。然后首先测试您的图表是否完全连接 - 这意味着每个值都与其他所有值相交,在这种情况下您不必继续。如果图不是全连接的,那么计算集群(社区),每个集群中的节点将代表你的交叉点
  • 请查看this question。您可以使用类似的方法

标签: scala apache-spark apache-spark-sql


【解决方案1】:

解决方案是 GraphFrame 库 (https://graphframes.github.io/graphframes/docs/_site/index.html)

免责声明:使用 Spark 2.4.4 和 GraphFrame 0.7.0 测试

import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window

import org.apache.spark.storage.StorageLevel

import scala.collection._

import org.graphframes.GraphFrame

object SparkApp extends App {

val appName = "appName"
val master = "local[*]"
  
val spark = SparkSession
  .builder
  .appName(appName)
  .master(master)
  .getOrCreate
 
import spark.implicits._

val dataTest =
      Seq(
        ("A", "B"),
        ("C", "D"),
        ("A", "E"),
        ("E", "F")
      ).toDF("C1", "C2")

// it's mandatory for GraphFrame
spark.sparkContext.setCheckpointDir("/some/path/hdfs/test_checkpoints")

// dataframe to list of vertices and connections list
val graphTest: GraphFrame = 
GraphFrame(
    dataTest.select('C1 as "id").union(dataTest.select('C2 as "id")).distinct, 
    dataTest.withColumnRenamed("C1", "src").withColumnRenamed("C2","dst")
    )

val graphComponentsTest = graphTest.connectedComponents.run()

val clustersResultTestDF = 
graphComponentsTest
  .groupBy("component")
  .agg(collect_list("id") as "intersections")


clustersResultTestDF.show
}

输出是

+--------------+
| intersections|
+--------------+
|[A, B, E, F]  |
|[C, D]        |
+--------------+

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2019-07-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-05-23
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多