【发布时间】:2017-06-29 02:16:53
【问题描述】:
我在 Hive 中有一张表,其中的数据来自 SAP 系统。该表的列和数据如下:
+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | | 586 |
+----------------------------------------------------------------------+
如上所示,vendor_account_number 列的值仅存在于 1 行中,我想将其应用于所有其余行。
预期输出如下:
+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 586 |
+----------------------------------------------------------------------+
为了实现这一点,我在 Hive 中编写了以下 CTE
with non_blank_account_no as(
select document_number, vendor_account_number
from my_table
where vendor_account_number != ''
)
然后进行self left outer join如下:
select
a.document_number, a.year,
a.cost_centre, a.amount,
b.vendor_account_number
from my_table a
left outer join non_blank_account_no b on a.document_number = b.document_number
where a.document_number = ' '
但我得到如下所示的重复输出
+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 586 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 586 |
+----------------------------------------------------------------------+
谁能帮我理解我的 Hive 查询出了什么问题?
【问题讨论】:
-
CTE 的输出是什么?只有1行吗?
-
@Andrew,是的,CTE 应该只选择 vendor_account_number 不为空的 1 行
-
我知道它应该只有 1 行,问题是它实际上有多少?看起来您肯定正在获得交叉联接。