【问题标题】:Finding the time spent by id in each location查找id在每个位置花费的时间
【发布时间】:2017-03-16 03:35:29
【问题描述】:

我试图找出每个 id 在起始位置花费了多长时间。

例如,在下面的数据集中,id 286 的起始 Geohash 是“abcdef”。 Geohash "abcdef" 出现在 ID 286 的 3 个位置。 因此,ID 286 花费的总时间是 (2017-02-13 12:33:02.063 UTC - 2017-02-13 12:24:36 UTC) 和 (2017-02-13 12:34:29 UTC - 2017-02-13 12:33:08 UTC)。

        Id         DateTime                      Latitude     Longitude   Geohash
      0 286        2017-02-13 12:24:36 UTC       40.769230  -73.01205     abcdef
      1 286        2017-02-13 12:33:02.063 UTC   40.769230  -73.01202     abcdef
      2 286        2017-02-13 12:33:05.063 UTC   40.769230  -73.01202     cvzvvv
      3 286        2017-02-13 12:33:08 UTC       40.769280  -73.01212     abcdef
      4 286        2017-02-13 12:34:29 UTC       40.769306  -73.01207     hsffds
      5 368        2017-02-13 00:23:07.063 UTC   33.392820  -111.8262     weruio
      6 141        2017-02-13 00:00:41 UTC       33.287117  -111.84150    oqruqq

不知道pandas dataframe中是否有实现这个操作的函数。

任何帮助将不胜感激。 !!

【问题讨论】:

    标签: pandas google-bigquery


    【解决方案1】:

    以下是 BigQuery 标准 SQL

    #standardSQL
    SELECT 
      Id, Geohash, MIN(DateTime) AS StartDateTime, SUM(TimeSpent) AS TimeSpent
    FROM (
      SELECT 
        Id, Geohash, DateTime, 
        TIMESTAMP_DIFF(LEAD(DateTime) OVER(PARTITION BY Id ORDER BY DateTime), DateTime, SECOND) AS TimeSpent,
        FIRST_VALUE(Geohash) OVER(PARTITION BY Id ORDER BY DateTime) AS FirstGeohash
      FROM yourTable
    )
    WHERE Geohash = FirstGeohash
    GROUP BY Id, Geohash  
    

    您可以使用示例中的虚拟数据对其进行测试:

    #standardSQL
    WITH yourTable AS (
      SELECT 286 AS Id, TIMESTAMP '2017-02-13 12:24:36 UTC' AS DateTime, 40.769230 AS Latitude, -73.01205 AS Longitude, 'abcdef' AS Geohash UNION ALL
      SELECT 286, TIMESTAMP '2017-02-13 12:33:02.063 UTC', 40.769230, -73.01202, 'abcdef' UNION ALL
      SELECT 286, TIMESTAMP '2017-02-13 12:33:05.063 UTC', 40.769230, -73.01202, 'cvzvvv' UNION ALL
      SELECT 286, TIMESTAMP '2017-02-13 12:33:08 UTC', 40.769280, -73.01212, 'abcdef' UNION ALL
      SELECT 286, TIMESTAMP '2017-02-13 12:34:29 UTC', 40.769306, -73.01207, 'hsffds' UNION ALL
      SELECT 368, TIMESTAMP '2017-02-13 00:23:07.063 UTC', 33.392820, -111.8262, 'weruio' UNION ALL
      SELECT 141, TIMESTAMP '2017-02-13 00:00:41 UTC', 33.287117, -111.84150, 'oqruqq'
    )
    SELECT 
      Id, Geohash, MIN(DateTime) AS StartDateTime, SUM(TimeSpent) AS TimeSpent
    FROM (
      SELECT 
        Id, Geohash, DateTime, 
        TIMESTAMP_DIFF(LEAD(DateTime) OVER(PARTITION BY Id ORDER BY DateTime), DateTime, SECOND) AS TimeSpent,
        FIRST_VALUE(Geohash) OVER(PARTITION BY Id ORDER BY DateTime) AS FirstGeohash
      FROM yourTable
    )
    WHERE Geohash = FirstGeohash
    GROUP BY Id, Geohash  
    

    结果如下

    Id  Geohash     StartDateTime           TimeSpent    
    286  abcdef     2017-02-13 12:24:36 UTC       590    
    368  weruio     2017-02-13 00:23:07 UTC      null    
    141  oqruqq     2017-02-13 00:00:41 UTC      null    
    

    请注意:590 以上是三页上花费的时间总和(以秒为单位) - 不仅仅是你问题中所述的两页 - 我认为这只是你这边的错字

    【讨论】:

      【解决方案2】:

      如果我理解正确,你想要这样的东西:

      def timedelta(df):
          df = df.sort_values(by='DateTime')
          return df.iloc[0]['DateTime'] - df.iloc[-1]['DateTime']
      
      df.groupby(['Id', 'Geohash']).apply(timedelta)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-05-22
        • 1970-01-01
        • 1970-01-01
        • 2021-12-13
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多