python：按部门（部门）计算半小时内建筑物中的人数答案

【问题标题】：python : count the number of people by division (department) thats in a building in half hour slotspython：按部门（部门）计算半小时内建筑物中的人数
【发布时间】：2020-04-07 13:50:53
【问题描述】：

我有以下数据集，每个员工在公司办公场所打卡和下班都有一条线。

我想创建一个矩阵（摘要），显示每个部门在半小时内有多少人在大楼里，如下所示：

我已经编写了半小时时间段内建筑物中有多少人的代码，但我无法弄清楚如何计算每个部门中有多少人在这些时间段内建筑物中。我尝试了许多不同的技术，但我无法弄清楚。楼里一共有多少人我写了下面的代码：

import pandas as pd
from pandas import Timestamp # import pandas date time 
# import a few rows of data.   our actual real data is much larger
sample_data = pd.DataFrame({'direction_in': {37196: Timestamp('2019-09-26 16:11:11'), 2364: Timestamp('2019-09-03 13:37:48'), 36266: Timestamp('2018-04-05 06:06:14'), 27159: Timestamp('2019-09-04 07:31:22'), 48518: Timestamp('2018-09-05 05:44:46')}, 'emp': {37196: 152.0, 2364: 10.0, 36266: 150.0, 27159: 115.0, 48518: 187.0}, 'direction_out': {37196: Timestamp('2019-09-26 16:32:20'), 2364: Timestamp('2019-09-03 22:21:04'), 36266: Timestamp('2018-04-05 18:15:21'), 27159: Timestamp('2019-09-04 15:58:02'), 48518: Timestamp('2018-09-05 15:51:51')}, 'time_difference': {37196: '0 days 00:21:09', 2364: '0 days 08:43:16', 36266: '0 days 12:09:07', 27159: '0 days 08:26:40', 48518: '0 days 10:07:05'}, 'complete_record': {37196: 'yes', 2364: 'yes', 36266: 'yes', 27159: 'yes', 48518: 'yes'}, 'terminal': {37196: 1.0, 2364: 1.0, 36266: 1.0, 27159: 1.0, 48518: 3.0}, 'job_title': {37196: 59.0, 2364: 14.0, 36266: 83.0, 27159: 82.0, 48518: 4.0}, 'division': {37196: 2.0, 2364: 1.0, 36266: 2.0, 27159: 1.0, 48518: 4.0}})

# Create a new dataframe the sumerised data. 
# The dataframe will contain 30 minute intervals from the first date to the last date in the above data 

department_clocked_in_matrix = pd.DataFrame() # Creates new dataframe 
department_clocked_in_matrix["date_time_from"] = pd.date_range(start="2018-02-12 00:00:00",end="2019-12-09 23:30:00",freq='30min') # Create from column 
department_clocked_in_matrix['date_time_to'] = (department_clocked_in_matrix['date_time_from'].shift(-1)).fillna(0) # Creates to_column, 30 minutes distance from the from column

# chop off the last value as it shows a zero value 
department_clocked_in_matrix = department_clocked_in_matrix.iloc[0:-1]
department_clocked_in_matrix


def sum_function(temp_df): 
    temp_sample = sample_data.loc[(temp_df.date_time_from >= sample_data.direction_in ) & (temp_df.date_time_to <= sample_data.direction_out),["division"] ].count()
    return temp_sample 

department_clocked_in_matrix2 = department_clocked_in_matrix.apply(sum_function, axis=1)  # axis one is accross column summing 
department_clocked_in_matrix["count"] = department_clocked_in_matrix2["division"]

【问题讨论】：

标签： python pandas pandas-groupby

【解决方案1】：

我将首先每 30 分钟从原始数据帧 (sample_data) 中的每一行构建一个数据帧，保持划分并连接所有这些数据帧。通过计算每小时和分区的行数并旋转数据框，我将获得每个分区的存在和计数。

从那时起，将其合并到department_clocked_in_matrix.merge 并添加total 列即可获得预期数据。

代码可以是：

tmp = pd.concat([
    pd.DataFrame({'count': 1, 'division':row.division},
             index = pd.date_range(row['direction_in'].floor('30T'),
                       row['direction_out'].ceil('30T'),
                       freq='30T', name='date_time_from'))
    for _, row in sample_data.iterrows()], sort=True).reset_index()

resul = tmp.groupby(['date_time_from', 'division']).sum().unstack()
#.reindex(
#    columns=[str(i) for i in range(1, 9)]).fillna(0).astype('int')

resul.columns = [str(int(i[1])) for i in resul.columns]
resul = resul.reindex(columns=[str(i) for i in range(1,9)])

resul = department_clocked_in_matrix.merge(resul, how='left', on='date_time_from'
                                           ).fillna(0)
resul = resul.set_index(['date_time_from', 'date_time_to']).astype('int')
resul = resul.assign(total=resul.sum(axis=1)).reset_index()

显示为：

           date_time_from        date_time_to  1  2  3  4  5  6  7  8  total
0     2018-02-12 00:00:00 2018-02-12 00:30:00  0  0  0  0  0  0  0  0      0
1     2018-02-12 00:30:00 2018-02-12 01:00:00  0  0  0  0  0  0  0  0      0
2     2018-02-12 01:00:00 2018-02-12 01:30:00  0  0  0  0  0  0  0  0      0
3     2018-02-12 01:30:00 2018-02-12 02:00:00  0  0  0  0  0  0  0  0      0
4     2018-02-12 02:00:00 2018-02-12 02:30:00  0  0  0  0  0  0  0  0      0
5     2018-02-12 02:30:00 2018-02-12 03:00:00  0  0  0  0  0  0  0  0      0
6     2018-02-12 03:00:00 2018-02-12 03:30:00  0  0  0  0  0  0  0  0      0
7     2018-02-12 03:30:00 2018-02-12 04:00:00  0  0  0  0  0  0  0  0      0
8     2018-02-12 04:00:00 2018-02-12 04:30:00  0  0  0  0  0  0  0  0      0
9     2018-02-12 04:30:00 2018-02-12 05:00:00  0  0  0  0  0  0  0  0      0
...                   ...                 ... .. .. .. .. .. .. .. ..    ...
31957 2019-12-09 18:30:00 2019-12-09 19:00:00  0  0  0  0  0  0  0  0      0
31958 2019-12-09 19:00:00 2019-12-09 19:30:00  0  0  0  0  0  0  0  0      0
31959 2019-12-09 19:30:00 2019-12-09 20:00:00  0  0  0  0  0  0  0  0      0
31960 2019-12-09 20:00:00 2019-12-09 20:30:00  0  0  0  0  0  0  0  0      0
31961 2019-12-09 20:30:00 2019-12-09 21:00:00  0  0  0  0  0  0  0  0      0
31962 2019-12-09 21:00:00 2019-12-09 21:30:00  0  0  0  0  0  0  0  0      0
31963 2019-12-09 21:30:00 2019-12-09 22:00:00  0  0  0  0  0  0  0  0      0
31964 2019-12-09 22:00:00 2019-12-09 22:30:00  0  0  0  0  0  0  0  0      0
31965 2019-12-09 22:30:00 2019-12-09 23:00:00  0  0  0  0  0  0  0  0      0
31966 2019-12-09 23:00:00 2019-12-09 23:30:00  0  0  0  0  0  0  0  0      0

[31967 rows x 11 columns]

【讨论】：

天才。谢谢 :-) 一开始我不明白你的回答。我意识到我的逻辑全错了，即temp_sample = sample_data.loc[(temp_df.date_time_from >= sample_data.direction_in ) & (temp_df.date_time_to <= sample_data.direction_out),["division"] ].count()" 行不行。一旦我弄清楚了，我就不明白代码了，但是在看了一段时间并将它拆开后，我现在明白了——我不得不用 python 函数重新编写它，并且用更少的 pandas 来理解它，然后我明白了为什么你这样做，用纯熊猫写得这么快。
我怎样才能把这样的东西变成对熊猫更友好的代码：index = pd.date_range("2020/04/13 08:00","2020/04/13 17:00",freq='30T', name='date_time_from') if min(index) != max(index): index = index.drop(max(index)) # drop index because person was not in the building for the final half hour else: index = index # the employee clocked in and out within half an hour so only value.

【解决方案2】：

它不是太漂亮，但它应该达到预期的效果。
我采用您的方法按 department_clocked_in_matrix 的每一行（30 分钟间隔）过滤 sample_data 并额外使用 groupby()
输出

Out[5]: 
          date_time_from        date_time_to    1    2    3    4    5
120  2019-09-03 12:00:00 2019-09-03 12:30:00  0.0  0.0  0.0  0.0  0.0
121  2019-09-03 12:30:00 2019-09-03 13:00:00  0.0  0.0  0.0  0.0  0.0
122  2019-09-03 13:00:00 2019-09-03 13:30:00  1.0  0.0  1.0  0.0  0.0
123  2019-09-03 13:30:00 2019-09-03 14:00:00  0.0  1.0  0.0  0.0  0.0
124  2019-09-03 14:00:00 2019-09-03 14:30:00  0.0  0.0  0.0  0.0  0.0

代码

from pandas import Timestamp # import pandas date time
import pandas as pd
import numpy as np

sample_data = pd.DataFrame({'direction_in': {37196: Timestamp('2019-09-03 13:11:11'), 2364: Timestamp('2019-09-03 13:37:48'), 36266: Timestamp('2019-09-03 13:06:14'), 27159: Timestamp('2019-09-04 07:31:22'), 48518: Timestamp('2018-09-05 05:44:46')}, 'emp': {37196: 152.0, 2364: 10.0, 36266: 150.0, 27159: 115.0, 48518: 187.0}, 'direction_out': {37196: Timestamp('2019-09-04 07:31:22'), 2364: Timestamp('2019-09-03 22:21:04'), 36266: Timestamp('2019-09-04 07:31:22'), 27159: Timestamp('2019-09-04 07:31:22'), 48518: Timestamp('2019-09-04 07:31:22')}, 'time_difference': {37196: '0 days 00:21:09', 2364: '0 days 08:43:16', 36266: '0 days 12:09:07', 27159: '0 days 08:26:40', 48518: '0 days 10:07:05'}, 'complete_record': {37196: 'yes', 2364: 'yes', 36266: 'yes', 27159: 'yes', 48518: 'yes'}, 'terminal': {37196: 1.0, 2364: 1.0, 36266: 1.0, 27159: 1.0, 48518: 3.0}, 'job_title': {37196: 59.0, 2364: 14.0, 36266: 83.0, 27159: 82.0, 48518: 4.0}, 'division': {37196: 1.0, 2364: 2.0, 36266: 3.0, 27159: 4.0, 48518: 5.0}})

department_clocked_in_matrix = pd.DataFrame() # Creates new dataframe
department_clocked_in_matrix["date_time_from"] = pd.date_range(start="2019-09-01 00:00:00",end="2019-09-30 23:30:00",freq='30min') # Create from column
department_clocked_in_matrix['date_time_to'] = (department_clocked_in_matrix['date_time_from'].shift(-1)).fillna(0) # Creates to_column, 30 minutes distance from the from column

# chop off the last value as it shows a zero value
department_clocked_in_matrix = department_clocked_in_matrix.iloc[0:-1]

possible_division = np.arange(1,6)

# for index, row in department_clocked_in_matrix[27290:27310].iterrows():
for index, row in department_clocked_in_matrix.iterrows():
    tmp_df = sample_data[(sample_data['direction_in'] >= row[0]) & (sample_data['direction_in'] <= row[1])]
    count = tmp_df.groupby('division').count().iloc[:,1]
    if not count.empty:
        for div, number in count.iteritems():
            department_clocked_in_matrix.loc[index, str(int(div))] = number

department_clocked_in_matrix.fillna(0,inplace=True)

注意： 我稍微改变了样本数据，以便更轻松地验证结果

【讨论】：