【问题标题】:SQL Statistical samplingSQL 统计抽样
【发布时间】:2012-10-08 00:33:29
【问题描述】:

我正在寻找一些天才的 SQL 帮助来解决我遇到的一个棘手的统计问题。

我要做的是从一组不平衡的用户配置文件中提取一个统计上平衡的样本。一次对单个配置文件属性(例如性别)执行此操作会有些简单。但是一次跨多个维度进行操作需要一些复杂性。

为了争论,假设我有这张桌子。

Profile.userID  
Profile.Gender  
Profile.Age  
Profile.Income

如果我想从组合中提取一个配置文件池,以便新的用户样本大致符合以下所有特征:

50% male, 50% female
30% young, 40% middle age, 40% old
40% low income, 40% middle income, 20% high income

有人对如何实现这一目标有任何想法吗?

【问题讨论】:

  • 是什么阻止了您一次随机抽取一个记录,直到样本集满足您的要求?
  • 如何防止它不断失去平衡?假设我只需要一张女性唱片,但拉动那张唱片会使我的年龄和收入失衡......?
  • 30% 年轻人,40% 中年人,40% 老年人!= 100% 在您的范围内,年轻人和中年人之间是否存在重叠?
  • 对不起 - 这只是我在示例中的糟糕数学。它应该是 30,40,30

标签: sql sql-server statistics


【解决方案1】:

您遇到的是抽样问题。他们解决这个问题的关键是将数据分成三个变量组合的单独组。然后,计算每个组的边际概率的乘积(您的值是边际概率)。然后,对所有 18 个组进行归一化。

例如,Male-Young-Low 组将获得 0.5*0.3*0.4 = 0.06 的值。您对所有 18 个组重复此操作,然后标准化为百分比(即,将每个值除以所有值的总和)。结果如下:

Gender  Age     Income  Marg    Normalized
Male    Young   Low     0.06    5.5%
Male    Young   Middle  0.06    5.5%
Male    Young   High    0.03    2.7%
Male    Middle  Low     0.08    7.3%
Male    Middle  Middle  0.08    7.3%
Male    Middle  High    0.04    3.6%
Male    Old     Low     0.08    7.3%
Male    Old     Middle  0.08    7.3%
Male    Old     High    0.04    3.6%
Female  Young   Low     0.06    5.5%
Female  Young   Middle  0.06    5.5%
Female  Young   High    0.03    2.7%
Female  Middle  Low     0.08    7.3%
Female  Middle  Middle  0.08    7.3%
Female  Middle  High    0.04    3.6%
Female  Old     Low     0.08    7.3%
Female  Old     Middle  0.08    7.3%
Female  Old     High    0.04    3.6%

这将成为每个组的采样率。下面是实际进行采样的伪 SQL 代码:

with SamplingRates (
    select 'Male' as gender, 'Young' as Age, 'Low' as income, 0.045 as SamplingRate,
    union all . . 
)
select t.*
from (select t.*,
            row_number() over (partition by gender, age, income order by <random>) as seqnum,
            count(*) over (partition by gender, age, income) as NumRecs
      from table t
     ) t join
     SampleRates sr
     on t.gender = sr.gender and t.age = sr.age and t.income = sr.income and
        seqnum <= sr.SamplingRate * NumRecs

【讨论】:

    【解决方案2】:

    我会这样做,假设: 30% 年轻人,40% 中年人,30% 老年人

    采用最小公分母,您的池大小 = 5x5x3x4x2x4 = 2400

    您有 18 个查询将您的池填充到 TEMP TABLE 中。重复所有 18 个查询以获得更大的池。下面是一个理想池的分布情况以及每个查询的情况的想法。您还可以在每个查询中引入一些随机性。以前有一篇关于这样做的帖子。

    这可能不太优雅,但应该会产生一个平衡的池。

    您的第一个伪代码查询如下所示:

    SELECT * INTO TEMP TABLE 
    WHERE male, young, high income and ID NOT IN TEMP TABLE 
    LIMIT RECORD SET 72
    

    等等等等。希望能帮助到你。好问题。

    CREATE TEMP TABLE
    480 high income
        144 young
            72 males [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72]
            72 females [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72]
        192 middle age
            96 males [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 96]
            96 females [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 96]
        144 old
            72 males [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72]
            72 females [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72]
    
    960 middle income
        288 young
            144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
            144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
        384 middle age 
            192 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192]
            192 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192]
        288 old
            144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
            144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
    
    960 low income
        288 young
            144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
            144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
        384 middle age
            192 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192]
            192 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192]
        288 old
            144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
            144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
    

    【讨论】:

      猜你喜欢
      • 2017-07-17
      • 2013-01-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-12-27
      • 2017-06-05
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多