如何在pyspark上使用嵌入式结构数组和数组连接两个配置单元表答案

【问题标题】：how to join two hive tables with embedded array of struct and array on pyspark如何在pyspark上使用嵌入式结构数组和数组连接两个配置单元表
【发布时间】：2021-01-20 19:16:55
【问题描述】：

我正在尝试在数据块上加入两个配置单元表。

tab1：

consumer_id (string) question_id                                  some_questions 
  "reghvsdvwe"       "rvsvbetvs-dvewdqwavd-363tr13r"              The contents are shown below

“some_questions”的架构

 array<struct<question_id:string, answers:array<struct<answer_id:string, date_made:timestamp, updated:timestamp>>>>

“some_questions”示例：

  0: 
   question_id: "rvsvbetvs-dvewdqwavd-363tr13r" 
   answers: 0: {"answer_id": "4363r23-46745y3-2er296", "date_made": "2006-11-02T00:00:00.000+0000", "Updated": "2006-12-01T00:00:00.000+0000"}

   1: 
     question_id: "rthdcva45-3t342r34y-vdvsdvds"
     answers: 0: {"answer_id": "eewgrg-2353t3-thetber", "date_made": "2006-05-12T00:00:00.000+0000", "Updated": "2006-05-12T00:00:00.000+0000"}

标签2：

   question_id (string)              answer_id(string)          question_contents            answer_contents
   "rvsvbetvs-dvewdqwavd-363tr13r"   "4363r23-46745y3-2er296"   "what do you like the food?"  "smell is good"

   "rthdcva45-3t342r34y-vdvsdvds"    "eewgrg-2353t3-thetber"    "how do you enjoy the travel ?"      "too much traffic in rush hour"

我需要通过 "question_id" 加入 tab1 和 tab2 以便我得到一个新表

 consumer_id    question_id                     question_content              answer_content
 "reghvsdvwe"   "rvsvbetvs-dvewdqwavd-363tr13r"  "what do you like the food?"   "smell is good"

 "reghvsdvwe"   "rthdcva45-3t342r34y-vdvsdvds"  "how do you enjoy the travel ?"   "too much traffic in rush hour"

我尝试通过 pyspark 加入他们。但是，我不确定如何使用嵌入式结构/数组分解数组。

谢谢

【问题讨论】：

你想加入两个标签的条件是什么？
通过“question_id”加入tab1和tab2，谢谢
嗨，@user3448011，您介意在我的回答中添加反馈吗？顺便提一句。您是希望通过some_question 列内的question_id 字段加入，还是作为独立列添加的新question_id 加入。使用后者与您想要的结果中显示的不匹配。
我需要通过“some_question”中的“question_id”加入他们，谢谢！

标签： python sql dataframe pyspark hive

【解决方案1】：

对于 SparkSQL，您可以使用 exists:

spark.sql("""
  SELECT t1.consumer_id, t2.answer_id, t2.question_contents, t2.answer_contents 
  FROM tab1 as t1
  JOIN tab2 as t2 ON exists(t1.some_questions, x -> x.question_id=t2.question_id)
""").show()
+-----------+--------------------+--------------------+--------------------+
|consumer_id|           answer_id|   question_contents|     answer_contents|
+-----------+--------------------+--------------------+--------------------+
| reghvsdvwe|4363r23-46745y3-2...|what do you like ...|       smell is good|
| reghvsdvwe|eewgrg-2353t3-the...|how do you enjoy ...|too much traffic ...|
+-----------+--------------------+--------------------+--------------------+

或array_contains:

spark.sql(""" 
  SELECT t1.consumer_id, t2.answer_id, t2.question_contents, t2.answer_contents 
  FROM tab1 as t1 
  JOIN tab2 as t2 ON array_contains(t1.some_questions.question_id, t2.question_id)
""").show()

使用 PySpark 语法：

from pyspark.sql.functions import expr
df_new = tab1.alias('t1').join(
  tab2.alias('t2'), 
  expr("array_contains(t1.some_questions.question_id, t2.question_id)")
).select('t1.consumer_id', 't2.question_id', 't2.question_contents', 't2.answer_contents')

【讨论】：