SQL：在某个值第一次出现之前选择数据答案

【问题标题】：SQL: select data before first occurence of a certain valueSQL：在某个值第一次出现之前选择数据
【发布时间】：2018-11-25 08:13:14
【问题描述】：

例如，我有来自客户的订单数据，像这样

test = spark.createDataFrame([
    (0, 1, 1, "2018-06-03"),
    (1, 1, 1, "2018-06-04"),
    (2, 1, 3, "2018-06-04"),
    (3, 1, 2, "2018-06-05"),
    (4, 1, 1, "2018-06-06"),
    (5, 2, 3, "2018-06-01"),
    (6, 2, 1, "2018-06-01"),
    (7, 3, 1, "2018-06-02"),
    (8, 3, 1, "2018-06-02"),
    (9, 3, 1, "2018-06-05")
])\
  .toDF("order_id", "customer_id", "order_status", "created_at")
test.show()

每个订单都有自己的状态，1 表示新创建但未完成，3 表示已付款并完成。

现在，我想对订单来源进行分析

新客户（之前没有购买过）
老客户（之前已完成购买）

所以我想给上面的数据加一个特征，变成这样

逻辑是针对每个客户的，在第一个订单之前创建的每个状态为3（包括它自己）的订单都算作来自new customer，之后的每个订单都算作old customer。

或者换一种说法，选择第一次出现值3之前的数据（对于每个客户的订单，按日期升序排序）

如何在 SQL 中做到这一点？

我四处寻找，但没有找到好的解决方案。如果在 Python 中，我想也许我会简单地做一些循环来获取值。

【问题讨论】：

您可以尝试使用 where exists 或 not exists 在匹配的客户 id 和 theta join (order_id

标签： sql pyspark

【解决方案1】：

这是针对 SQLite 测试的：

SELECT order_id, customer_id, order_status, created_at, 
CASE
     WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'
     ELSE 'new'  
END AS customer_status
FROM orders o

【讨论】：

为什么使用order_id？
以customer_id=2 的两行为例，它们都有created_at=2018-06-01。因此，如果我们根据created_at 比较这两行，则没有before 或after。如果我将代码中的order_id 替换为created_at，那么这两行都会给出new。我认为使用order_id 更安全，除非它与下订单的日期无关。

【解决方案2】：

您可以使用 Spark 中的窗口函数来做到这一点：

select t.*,
       (case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)
             then 'old'
             else 'new'
        end) as customer_status
from test t;

请注意，这会将“新”分配给没有订单且状态为“3”的客户。

您也可以使用join 和group by 编写此代码：

select t.*,
       coalesce(t3.customer_status, 'old') as customer_status
from test t left join
     (select t.customer_id, min(created_at) as min_created_at,
             'new' as customer_status
      from t
      where status = 3
      group by t.customer_id
     ) t3
     on t.customer_id = t3.customer_id and
        t.created_at <= t3.min_created_at;

【讨论】：