【发布时间】:2019-05-05 20:44:36
【问题描述】:
源表的格式
CREATE TABLE IF NOT EXISTS src_table (
str_1 String,
str_2 String,
metric_1 UInt64,
metric_2 UInt8
) ENGINE = Log
要反规范化的列是 str_2 和非规范化表
CREATE TABLE IF NOT EXISTS denorm_table (
dt Date,
str_1 String,
attr_1 UInt64,
attr_2 UInt64,
......
attr_1000 UInt64,
attr_1001 UInt8,
attr_1002 UInt8,
.....
attr_2000 UInt8
) ENGINE = MergeTree PARTITION BY (dt) ORDER BY (dt, str_1) SETTINGS index_granularity=8192
假设 str_2 列有 1000 个不同的值 (1 ... 1000),并且
attr_1是str_2等于1时metric_1列的值,
attr_2是str_2等于1时metric_1列的值,
.....
attr_1001 是 str_2 等于 1 时列 metric_2 的值
...
反规范化查询是:
INSERT INTO denorm_table
(dt, user, attr_1, attr_2, ..., attr_1000, attr_1001, attr_2000)
SELECT
'2018-11-01' as dt,
str_1,
arrayElement( groupArray(metric_1), indexOf(groupArray(str_2), '1') ) as attr_1,
arrayElement( groupArray(metric_1), indexOf(groupArray(str_2), '2') ) as attr_2,
......
arrayElement( groupArray(metric_1), indexOf(groupArray(str_2), '1000') ) as attr_1000,
arrayElement( groupArray(metric_2), indexOf(groupArray(str_2), '1001') ) as attr_1001,
.....
arrayElement( groupArray(metric_1), indexOf(groupArray(str_2), '2000') ) as attr_2000
FROM src_table
WHERE str_2 in ('1', '2', .....)
GROUP BY str_1
对于列 str_2 的 750 个值(非规范化表中的 1502 列),查询正常工作。
但是当非规范化表的列数为 2002(以及相应的 str_2 值 1000)时,我有一个 socket.timeout: timed out 错误
File "/usr/lib/python2.7/site-packages/clickhouse_driver/client.py", line 119, in execute
columnar=columnar
File "/usr/lib/python2.7/site-packages/clickhouse_driver/client.py", line 192, in process_ordinary_query
columnar=columnar)
File "/usr/lib/python2.7/site-packages/clickhouse_driver/client.py", line 42, in receive_result
return result.get_result()
File "/usr/lib/python2.7/site-packages/clickhouse_driver/result.py", line 39, in get_result
for packet in self.packet_generator:
File "/usr/lib/python2.7/site-packages/clickhouse_driver/client.py", line 54, in packet_generator
packet = self.receive_packet()
File "/usr/lib/python2.7/site-packages/clickhouse_driver/client.py", line 68, in receive_packet
packet = self.connection.receive_packet()
File "/usr/lib/python2.7/site-packages/clickhouse_driver/connection.py", line 331, in receive_packet
packet.type = packet_type = read_varint(self.fin)
File "/usr/lib/python2.7/site-packages/clickhouse_driver/reader.py", line 38, in read_varint
i = _read_one(f)
File "/usr/lib/python2.7/site-packages/clickhouse_driver/reader.py", line 23, in _read_one
c = f.read(1)
File "/usr/lib64/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
socket.timeout: timed out
客户端/服务器中是否有可以消除问题的设置?
来自日志
2018.12.04 22:49:26.755926 [ 36 ] {} <Trace> SystemLog (system.query_thread_log): Flushing system log
2018.12.04 22:49:26.756233 [ 139 ] {821ce7ea-94b7-4675-96f5-feccb31b0ebe} <Error> executeQuery: Code: 32, e.displayText() = DB::Exception: Attempt to read after eof, e.what() = DB::Exception (from [::1]:52224) (in query:
========= 编辑 =========
我修改了如下查询(@johey)并且错误没有再次发生:
WHERE modulo(sipHash64(str_1), 20) = 0 用于分组拆分数据
而不是对列 str_1 中的所有值运行查询
INSERT INTO dst_table (....)
SELECT
arrayElement(metric_1_array, indexOf(str_2_array, '1') ) as attr_1,
arrayElement(metric_1_array, indexOf(str_2_array, '2') ) as attr_2,
......
arrayElement(metric_2_array, indexOf(str_2_array, '1') ) as attr_1001,
......
FROM (
SELECT
str_1,
groupArray(metric_1) metric_1_array,
groupArray(metric_2) metric_2_array,
groupArray(str_2) str_2_array
FROM src_table
WHERE modulo(sipHash64(str_1), 20) = 0
AND str_2 in ('1', '2', ......)
GROUP BY str_1
)
【问题讨论】:
-
可能这并不能解决问题,但也许您可以像这样缩短查询:
INSERT INTO denorm_table (...) SELECT '2018-11-01' as dt, str_1, arrayElement(metric1, indexOf(str2, '1') ) as attr_1, arrayElement(metric1, indexOf(str2, '2') ) as attr_2, ... FROM ( SELECT user, str_1, groupArray(metric_1) metric1, groupArray(metric_2) metric2, groupArray(str_2) str2 FROM src_table WHERE str_2 in ('1', '2', .....) ) s GROUP BY user -
可能是愚蠢的评论(因为我不明白您要做什么),但
indexOf(groupArray(str_2), '1')也不会在 str_2 等于时给出“正负”结果,例如1001 ?这不是你想要的吗? -
这个例子不是独立的。
user列是什么? -
user是str_1列 -
arrayElement( groupArray(metric_1), indexOf(groupArray(str_2), '1') )---> 我想检索与 str_2 列的值“1”相对应的 metric_1 列的值(创建 2 列数组并使用第二个数组的元素来索引第一个)@johey
标签: sql denormalization clickhouse