如何固定蜂巢中的种子答案

【问题标题】：How to fix seed in hive如何固定蜂巢中的种子
【发布时间】：2018-09-17 10:59:30
【问题描述】：

我正在运行一项实验，需要为测试组和对照组固定受众。这是我正在使用的查询：

 select consumer_id,
  case when rand(5555)<0.5 then 'control'
       else 'experiment' 
   end as groups
from my_table

如果我使用相同的查询创建两个表并将它们连接起来，它们具有相同的拆分，但如果我在同一个查询中连接在一起，则会为每个表提供不同的拆分。

select a.groups,b.groups,count(*) from
 (select consumer_id,
  case when rand(5555)<0.5 then 'control'
       else 'experiment' 
   end as groups
from my_table) a
left join 
 (select consumer_id,
        case when rand(5555)<0.5 then 'control'
             else 'experiment' 
         end as groups
from my_table) b on a.consumer_id = b.consumer_id
group by a.groups,b.groups;

知道为什么会这样以及我可以使用哪个函数在 hive 中播种

【问题讨论】：

由于存在连接，“左”random() 调用可能不会针对某些行进行评估。如果您首先使用相同的随机过滤器创建临时表，那应该可以。虽然，在大多数情况下，在 customer_id 上使用某种散列函数和 mod 是可行的方法（例如，hash(customer_id) mod 100 = 1）。也许，如果客户 ID 是“随机的”，您甚至不需要哈希函数。

标签： hive seed seeding

【解决方案1】：

我发现了两次查询表的相同错误（与联接本身无关）。尝试设置

hive.optimize.index.filter=false;

如果它有效，请告诉我，如果无效，我认为还有一个附加属性存在两次查询表的错误。

【讨论】：