【问题标题】:How to write a BigQuery query to find rows where a specific column changes in a table如何编写 BigQuery 查询以查找表中特定列发生变化的行
【发布时间】:2020-05-02 13:55:33
【问题描述】:

我需要为记录列中值更改日期的表编写查询。该表使得以下查询产生相应的结果。

  SELECT
    employeeId,
    date,
    location,
  FROM
    MY_TABLE 
 ORDER BY 
    employeeId, date, location

结果:

+----+--------------+------------+------------------+
|    |   employeeId | date       | location         |
+====+==============+============+==================+
|  0 |         2467 | 2016-04-31 | COUNTRY A        |
+----+--------------+------------+------------------+
|  1 |         2467 | 2016-05-31 | COUNTRY A        |
+----+--------------+------------+------------------+
|  2 |         2467 | 2016-06-31 | COUNTRY A        |
+----+--------------+------------+------------------+
|  3 |         2467 | 2016-07-31 | COUNTRY A        |
+----+--------------+------------+------------------+
|  4 |         2467 | 2016-08-31 | COUNTRY B        |
+----+--------------+------------+------------------+
|  5 |         2467 | 2017-09-31 | COUNTRY A        |
+----+--------------+------------+------------------+

对于每个employeeId,如果位置在两个日期之间发生变化,我想要旧日期、旧位置、新日期和新位置。这是我写的查询:

WITH
  cte AS (
  SELECT
    employeeId,
    date,
    location,
  FROM
    MY_TABLE),
  movements AS (
  SELECT
    a.employeeId AS EMPLOYEEID,
    b.employeeId AS EMPLOYEEID_NEW,
    a.date AS OLD_DATE,
    b.date AS NEW_DATE,
    a.location AS OLD_LOCATION,
    b.location AS NEW_LOCATION
  FROM
    cte a
  INNER JOIN
    cte b
  ON
    a.employeeId = b.employeeId
  WHERE
    b.date > a.date  
    AND DATE_DIFF(b.date, a.date, MONTH) = 1
      AND a.location <> b.location 
)
SELECT
  NEW_DATE,
  OLD_DATE, 
    COUNT(EMPLOYEEID) AS MOVED,
  OLD_LOCATION,
  NEW_LOCATION
FROM
  movements
GROUP BY
  NEW_DATE,
  OLD_DATE,
  EMPLOYEEID,
  OLD_LOCATION,
  NEW_LOCATION
ORDER BY
MOVED,
  NEW_DATE,
  OLD_LOCATION,
  NEW_LOCATION

我得到以下结果:

+----+------------+------------+---------+----------------+----------------+
|    | NEW_DATE   | OLD_DATE   |   MOVED | OLD_LOCATION   | NEW_LOCATION   |
+====+============+============+=========+================+================+
|  0 | 2016-07-01 | 2016-06-01 |       1 | COUNTRY A      | COUNTRY B      |
+----+------------+------------+---------+----------------+----------------+
|  1 | 2016-07-01 | 2016-06-30 |       1 | COUNTRY A      | COUNTRY B      |
+----+------------+------------+---------+----------------+----------------+
|  2 | 2016-07-31 | 2016-06-30 |       1 | COUNTRY A      | COUNTRY B      |
+----+------------+------------+---------+----------------+----------------+
|  3 | 2016-07-31 | 2016-06-01 |       1 | COUNTRY A      | COUNTRY B      |
+----+------------+------------+---------+----------------+----------------+
|  4 | 2016-08-01 | 2016-07-01 |       1 | COUNTRY C      | COUNTRY B      |
+----+------------+------------+---------+----------------+----------------+
|  5 | 2016-08-01 | 2016-07-31 |       1 | COUNTRY C      | COUNTRY B      |
+----+------------+------------+---------+----------------+----------------+
|  6 | 2016-08-31 | 2016-07-01 |       1 | COUNTRY C      | COUNTRY B      |
+----+------------+------------+---------+----------------+----------------+
|  7 | 2016-08-31 | 2016-07-31 |       1 | COUNTRY C      | COUNTRY B      |
+----+------------+------------+---------+----------------+----------------+

结果似乎不正确。我非常怀疑两国之间的移动次数总是 1 ......你能看看这个查询,让我知道我在哪里出错了吗?另外,仅供参考,我已经混淆了此处提供的数据。基本上,我换了国家名称和日期。

【问题讨论】:

    标签: mysql sql postgresql google-bigquery


    【解决方案1】:

    我无法在您的查询中确定您是如何选择上一个日期和地点的。但是,我能够简化它并实现您的目标。

    首先,为了检查某些情况,我稍微更改了您的虚拟数据。因此,我使用了以下内容:

    employeeId|date|location
    2467|2016-04-30|COUNTRY A
    2467|2016-05-31|COUNTRY A
    2467|2016-06-30|COUNTRY B
    2467|2016-07-31|COUNTRY A
    2467|2016-08-31|COUNTRY B
    2467|2017-09-30|COUNTRY A
    2468|2017-09-30|COUNTRY A
    2468|2017-09-30|COUNTRY A
    

    请注意,employeeId 2467 更改了 4 次国家/地区。

    我创建了以下脚本:

    WITH data AS (
    SELECT employeeid, date, location, (LAG(date) OVER (PARTITION BY employeeid ORDER BY date ASC)) AS prev_date,
            (LAG(location) OVER (PARTITION BY employeeid ORDER BY date ASC)) AS prev_loc
    FROM `test-proj-261014.bq_load_codelab.employee`
    ORDER BY date
    )
    SELECT * FROM data
    WHERE DATE_DIFF(date, prev_date, MONTH)>=1 
      AND prev_loc IS NOT NULL
      AND location<>prev_loc
     ORDER BY date
    

    如您所见,我使用LAG() 函数来选择每一行的前一个日期和位置。我想指出,当 LAG() 用于第一行时,它返回 null。为此,使用了过滤器WHEN prev_loc IS NOT NULL

    输出如下:

    如您所见,我选择了 employeeid 来有效地检查结果。不过,您可以从最后一个 select 语句中删除此字段并仅检索您希望的字段。

    最后,如果您想检查员工移动了多少次,您将需要另一段代码来查询上表。当您使用COUNT() 时,您无法检索旧日期和新日期,如上所述,因为您正在计算每个员工移动的次数并按employeeid 分组。这意味着您将拥有每个 employeeid 的数字(计数)。因此,在这种情况下,我将上述结果保存在一个名为 final_output 的临时表中,以如下方式查询它:

    SELECT employeeid, count(employeeid) as MOVED FROM final_output
     GROUP BY employeeid
    

    还有输出:

    ID 为 2467 的雇主在分析的时间范围内搬迁了 4 次。

    【讨论】:

      猜你喜欢
      • 2012-07-05
      • 1970-01-01
      • 2022-01-18
      • 1970-01-01
      • 1970-01-01
      • 2021-12-03
      • 1970-01-01
      • 2015-06-11
      • 1970-01-01
      相关资源
      最近更新 更多