这里有一个解决方案。这有点小技巧,但不管你有多少钥匙都可以工作。
udf0.py
#/usr/bin/python
import sys
from collections import Counter
for line in sys.stdin:
words = line.strip().split('\t')
c = Counter()
for word in words:
d = {}
s = word.split(',')
for ss in s:
k,v = ss.split(':')
d[k] = int(v)
c.update(d)
print ','.join([str(k)+':'+str(v) for k,v in dict(c).iteritems()])
udf1.py
#!/usr/bin/python
import sys
for line in sys.stdin:
w0, w1 = line.strip().split('\t')
out = {}
d = {}
l = []
s0 = w0.strip().split(',')
s1 = w1.strip().split(',')
for ss in s0:
k,v = ss.split(':')
d[k] = int(v)
for ss in s1:
l.append(ss)
for keys in l:
if d.get(keys, None) is not None:
out[keys] = d[keys]
else:
out[keys] = 0
print ','.join([str(k)+':'+str(v) for k,v in out.iteritems()])
Hive 查询:
add file /home/username/udf0.py;
add file /home/username/udf1.py;
SELECT TRANSFORM(dict, unique_keys)
USING 'python udf1.py'
AS (final_map STRING)
FROM (
SELECT DISTINCT dict
, CONCAT_WS(',', unique_keys) as unique_keys
FROM (
SELECT dict
, COLLECT_SET(keys) OVER () AS unique_keys
FROM (
SELECT dict
, keys
FROM (
SELECT dict
, map_keys(str_to_map(dict)) AS key_arr
FROM (
SELECT TRANSFORM (col1, col2)
USING 'python udf0.py'
AS (dict STRING)
FROM db.tbl ) x ) z
LATERAL VIEW EXPLODE(key_arr) exptbl AS keys ) a ) b ) c
输出:
a:6,b:2,c:6,d:0
a:21,b:7,c:0,d:5
说明:
第一个 UDF 将获取您的字符串,将其转换为 python 字典并更新键(即,将具有匹配键的值相加)。由于您不知道实际的键,因此您将知道需要从每个字典中提取键(hive 查询中的map_keys()),分解表,然后将它们收集回一个唯一的集合中。现在,您将拥有任何字典中所有可能的键。然后从那里,您可以使用第二个 UDF 导入在第一个 UDF 中创建的字典,检查每个键是否存在,如果不存在,则将其值设为零。