只是添加我如何解决这个问题的概述 - 有点 hacky,但我发现这是最快的方法,可以很好地扩展。
输入表如下:
{
"ip": "130.211.149.140",
"ip_int": "2194904460",
"ip_part1": "130",
"ip_part2": "211",
"ip_part3": "149",
"ip_part4": "140",
"num_requests": "6811"
}
查找表是这样的:
{
"de_ip_key": "DE18_92.66.156.93_92.66.156.112",
"ip_key": "92.66.156.93_92.66.156.112",
"ip_from_int": "1547869277",
"ip_to_int": "1547869296",
"ip_from": "92.66.156.93",
"ip_to": "92.66.156.112",
"naics_code": "518210",
"ip_from_part1": "92",
"ip_from_part2": "66",
"ip_from_part3": "156",
"ip_from_part4": "93",
"ip_to_part1": "92",
"ip_to_part2": "66",
"ip_to_part3": "156",
"ip_to_part4": "112"
}
因此,使用 ip 地址的第 1 部分和第 2 部分来加入作为减少搜索空间的一种方式(我的查找表中的 from 和 to 范围往往不会像第 1 部分和第 2 部分一样宽- 如果是这样,这种方法失败了)。
select
ip,
ip_int,
-- pick first info from de
first(ip_key) as ip_key,
first(de_ip_key) as de_ip_key,
first(naics_code) as naics_code
from
(
select
ip as ip,
ip_int as ip_int,
ip_key as ip_key,
de_ip_key as de_ip_key,
naics_code as naics_code,
from
-- join based on part 1 and 2 of ip from range
(
select
input.ip as ip,
input.ip_int as ip_int,
if(input.ip_int between de.ip_from_int and de.ip_to_int,de.ip_key,null) as ip_key,
if(input.ip_int between de.ip_from_int and de.ip_to_int,de.de_ip_key,null) as de_ip_key,
if(input.ip_int between de.ip_from_int and de.ip_to_int,de.naics_code,null) as naics_code,
from
[ip.lookup_input_tbl] input
left outer join each
[digital_element.data_naics_code] de
on
input.ip_part1=de.ip_from_part1
and
input.ip_part2=de.ip_from_part2
group by 1,2,3,4,5
),
-- join based on part 1 and 2 of ip to range
(
select
input.ip as ip,
input.ip_int as ip_int,
if(input.ip_int between de.ip_from_int and de.ip_to_int,de.ip_key,null) as ip_key,
if(input.ip_int between de.ip_from_int and de.ip_to_int,de.de_ip_key,null) as de_ip_key,
if(input.ip_int between de.ip_from_int and de.ip_to_int,de.naics_code,null) as naics_code,
from
[ip.lookup_input_tbl] input
left outer join each
[digital_element.data_naics_code] de
on
input.ip_part1=de.ip_to_part1
and
input.ip_part2=de.ip_to_part2
group by 1,2,3,4,5
),
group by 1,2,3,4,5
-- order so null records from either join go to bottom and get left behind on the first group by
order by ip_int,ip_key desc
)
group by 1,2
所以它基本上会吹出数据(通过在 ip 地址的第 1 部分和第 2 部分以及 ip_from 和 ip_to 地址上进行相等连接),然后使用 if between 语句在组上减少它(这样做而不是 where 条件确保您获得正确的左外连接,因此您还可以查看您处理了哪些记录但在查找表中没有信息)。
Defo 不是最漂亮的,可能还有一两种优化它的方法,但现在对我有用,并在 10-20 秒内根据 16M 记录的查找表查找 500K 输入 IP 地址。