【发布时间】:2021-01-20 19:16:55
【问题描述】:
我正在尝试在数据块上加入两个配置单元表。
tab1:
consumer_id (string) question_id some_questions
"reghvsdvwe" "rvsvbetvs-dvewdqwavd-363tr13r" The contents are shown below
“some_questions”的架构
array<struct<question_id:string, answers:array<struct<answer_id:string, date_made:timestamp, updated:timestamp>>>>
“some_questions”示例:
0:
question_id: "rvsvbetvs-dvewdqwavd-363tr13r"
answers: 0: {"answer_id": "4363r23-46745y3-2er296", "date_made": "2006-11-02T00:00:00.000+0000", "Updated": "2006-12-01T00:00:00.000+0000"}
1:
question_id: "rthdcva45-3t342r34y-vdvsdvds"
answers: 0: {"answer_id": "eewgrg-2353t3-thetber", "date_made": "2006-05-12T00:00:00.000+0000", "Updated": "2006-05-12T00:00:00.000+0000"}
标签2:
question_id (string) answer_id(string) question_contents answer_contents
"rvsvbetvs-dvewdqwavd-363tr13r" "4363r23-46745y3-2er296" "what do you like the food?" "smell is good"
"rthdcva45-3t342r34y-vdvsdvds" "eewgrg-2353t3-thetber" "how do you enjoy the travel ?" "too much traffic in rush hour"
我需要通过 "question_id" 加入 tab1 和 tab2 以便我得到一个新表
consumer_id question_id question_content answer_content
"reghvsdvwe" "rvsvbetvs-dvewdqwavd-363tr13r" "what do you like the food?" "smell is good"
"reghvsdvwe" "rthdcva45-3t342r34y-vdvsdvds" "how do you enjoy the travel ?" "too much traffic in rush hour"
我尝试通过 pyspark 加入他们。但是,我不确定如何使用嵌入式结构/数组分解数组。
谢谢
【问题讨论】:
-
你想加入两个标签的条件是什么?
-
通过“question_id”加入tab1和tab2,谢谢
-
嗨,@user3448011,您介意在我的回答中添加反馈吗?顺便提一句。您是希望通过
some_question列内的question_id字段加入,还是作为独立列添加的新question_id加入。使用后者与您想要的结果中显示的不匹配。 -
我需要通过“some_question”中的“question_id”加入他们,谢谢!
标签: python sql dataframe pyspark hive