【问题标题】:How can I build graph (using Graphx) from Spark Dataframe?如何从 Spark Dataframe 构建图形(使用 Graphx)?
【发布时间】:2022-06-10 16:57:35
【问题描述】:

我已经创建了一个 Spark DataFrame,以便通过 Graphx 构建图形,Graphx 是 Spark 的 API 并接受 Spark Dataframe 格式。所以,现在我有了这样的数据,

+--------------------+----------------+------+
|           hotel_url|          author|rating|
+--------------------+----------------+------+
|Hotel_Review-g194...|    violettaf340|     5|
|Hotel_Review-g194...|       Lagaiuzza|     5|
|Hotel_Review-g194...|      ashleyn763|     5|
|Hotel_Review-g194...|     DavideMauro|     5|
|Hotel_Review-g194...|        Alemma11|     4|
|Hotel_Review-g194...|       ladispoli|     4|
|Hotel_Review-g303...|       LiliT0URS|     3|
|Hotel_Review-g303...|     Amandainldn|     4|
|Hotel_Review-g303...|TwoMonkeysTravel|     5|
|Hotel_Review-g303...|     BiancaB3358|     4|
|Hotel_Review-g303...|    Brett-Sweden|     4|
|Hotel_Review-g303...|      analuizade|     5|
|Hotel_Review-g303...|          heckfy|     5|
|Hotel_Review-g303...|  MatheusMedrado|     3|
|Hotel_Review-g303...|TwoMonkeysTravel|     5|
|Hotel_Review-g303...|          SaStar|     4|
|Hotel_Review-g303...|   chrisbG2838DY|     4|
|Hotel_Review-g303...|        virninha|     5|
|Hotel_Review-g303...|    AugustusC_13|     5|
|Hotel_Review-g303...|         AnnaMir|     5|
+--------------------+----------------+------+

我想问你,如何从 Spark Dataframe 创建一个具有 [ (Node: hotel_url) --- (weight: rating) --- (Node: author)] 这种类型关系的图表?

您也可以从给定的图中了解所需的关系。

【问题讨论】:

    标签: python apache-spark pyspark spark-graphx


    【解决方案1】:
    import org.apache.spark.sql.SparkSession
    import org.apache.spark.graphx.Edge
    import org.apache.spark.sql.types._
    import org.apache.spark.graphx.Graph
    import org.apache.spark.sql.functions._
    import org.apache.spark.graphx._
    import org.apache.spark.rdd.RDD
    import spark.implicits._
    
    val data = List(
      ("Hotel_Review-g194...", "violettaf340", 5),
      ("Hotel_Review-g194...", "Lagaiuzza", 5),
      ("Hotel_Review-g194...", "ashleyn763", 5),
      ("Hotel_Review-g194...", "DavideMauro", 5),
      ("Hotel_Review-g194...", "Alemma11", 4),
      ("Hotel_Review-g194...", "ladispoli", 4),
      ("Hotel_Review-g303...", "LiliT0URS", 3),
      ("Hotel_Review-g303...", "Amandainldn", 4),
      ("Hotel_Review-g303...", "TwoMonkeysTravel", 5),
      ("Hotel_Review-g303...", "BiancaB3358", 4),
      ("Hotel_Review-g303...", "Brett-Sweden", 4),
      ("Hotel_Review-g303...", "analuizade", 5),
      ("Hotel_Review-g303...", "heckfy", 5),
      ("Hotel_Review-g303...", "MatheusMedrado", 3),
      ("Hotel_Review-g303...", "TwoMonkeysTravel", 5),
      ("Hotel_Review-g303...", "SaStar", 4),
      ("Hotel_Review-g303...", "chrisbG2838DY", 4),
      ("Hotel_Review-g303...", "virninha", 5),
      ("Hotel_Review-g303...", "AugustusC_13", 5),
      ("Hotel_Review-g303...", "AnnaMir", 5)
    ).toDF("hotel_url", "author", "rating")
    
    val vertices: RDD[(VertexId, String)] = data
      .select(explode(array(col("hotel_url"), col("author"))))
      .dropDuplicates()
      .rdd
      .map(_.getAs[String](0))
      .zipWithIndex
      .map(_.swap)
    
    val vertDF = vertices.toDF("id", "node")
    
    val edges = data
      .join(vertDF, data.col("hotel_url") === vertDF("node"))
      .select('author, 'rating.cast(StringType), 'id as 'idS)
      .join(vertDF, data("author") === vertDF("node"))
      .rdd
      .map(row =>
        Edge(
          row.getAs[Long]("idS"),
          row.getAs[Long]("id"),
          "rating: " + row.getAs[String]("rating")
        )
      )
    
    val graph = Graph(vertices, edges)
    
    graph.vertices.foreach(println _)
    //    (2,Amandainldn)
    //    (7,heckfy)
    //    (5,DavideMauro)
    //    (0,MatheusMedrado)
    //    (4,ashleyn763)
    //    (1,LiliT0URS)
    //    (3,chrisbG2838DY)
    //    (9,Brett-Sweden)
    //    (11,virninha)
    //    (12,BiancaB3358)
    //    (16,AnnaMir)
    //    (10,TwoMonkeysTravel)
    //    (6,SaStar)
    //    (17,AugustusC_13)
    //    (19,ladispoli)
    //    (20,Alemma11)
    //    (14,analuizade)
    //    (8,Lagaiuzza)
    //    (18,violettaf340)
    //    (15,Hotel_Review-g194...)
    //    (13,Hotel_Review-g303...)
    
    graph.edges.foreach(println(_))
    //    Edge(13,0,rating: 3)
    //    Edge(13,1,rating: 3)
    //    Edge(13,3,rating: 4)
    //    Edge(15,4,rating: 5)
    //    Edge(13,2,rating: 4)
    //    Edge(13,12,rating: 4)
    //    Edge(13,10,rating: 5)
    //    Edge(13,10,rating: 5)
    //    Edge(15,8,rating: 5)
    //    Edge(15,5,rating: 5)
    //    Edge(13,9,rating: 4)
    //    Edge(13,6,rating: 4)
    //    Edge(13,11,rating: 5)
    //    Edge(15,18,rating: 5)
    //    Edge(13,14,rating: 5)
    //    Edge(13,16,rating: 5)
    //    Edge(15,19,rating: 4)
    //    Edge(13,17,rating: 5)
    

    【讨论】:

      猜你喜欢
      • 2014-12-08
      • 1970-01-01
      • 1970-01-01
      • 2018-09-28
      • 2015-10-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多