按时间分组，然后仅当列表中存在唯一条目时才计算唯一条目 [熊猫]答案

【问题标题】：Group by time, and then count unique entries only if these existed in a list [Panda]按时间分组，然后仅当列表中存在唯一条目时才计算唯一条目 [熊猫]
【发布时间】：2018-04-25 06:38:57
【问题描述】：

考虑如下熊猫数据框“df”和python列表“my_list”。

df =

timestamp  address    type
1           1          A
2           9          B
3           3          A
4           6          B
5           6          B
6           2          B
7           3          A
8           2          B
9           1          B
10          3          A
11          3          A
12          3          A

我的列表 =

[1, 2, 3]

现在我想要的是在 3 秒的 bin 中按时间戳列对数据帧进行分组，并仅在“my_list”中存在地址时计算唯一“类型”的数量。

预期的输出应如下所示：

timestamp   A    B    
1           2    0 #One "B" ignored, because address=9 is not in my_list
4           0    1 #Two "B" ignored because address is not in "my_list
7           1    2 #Two "B" with unique addresses, and one "A"
10          1    0 #Three rows with Type="A", but addresses are is same.

请注意，时间戳值最初是时间戳格式，我们可以应用 df.groupby 和 pd.TimeGrouper 函数对 3 秒列中的行进行分组。

仅欣赏基于 Pandas (Python) 的答案。

如有任何混淆，我们深表歉意。我尽量保持简单。

-- 汗

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

使用pd.get_dummies

grps = df.timestamp.sub(1).floordiv(3).mul(3).add(1)
dups = df[['address', 'type']].assign(grps=grps).duplicated().values
inmy = df.address.isin(my_list).values

pd.get_dummies(df.set_index(grps)[inmy & ~dups].type).sum(level=0).reset_index()

   timestamp  A  B
0          1  2  0
1          4  0  1
2          7  1  2
3         10  1  0

【讨论】：

先生，我需要解释这个.sub(1).floordiv(3).mul(3).add(1) :)
我减去1 以获得基于零的秒数。然后进行楼层划分以获得三人一组。我乘以3 来缩小规模，但不会改变分组的完整性。我添加1 以切换回基于秒的秒数。
那个 eq == (((df['timestamp'] -1) // 3)*3)+1 对吗？现在我明白了。一如既往的好先生
感谢@piRSquared 的好意。

【解决方案2】：

用途：

#convert index to triples
df.index = df.index // 3
#filter rows by condition
df1 = df[df['address'].isin(my_list)]
#get unique numbers and reshape
df1 = df1['address'].groupby([df1.index, df1['type']]).nunique().unstack(fill_value=0)
#add timestamps
df1.index = df['timestamp'].groupby(df.index).first()
print (df1)
type       A  B
timestamp      
1          2  0
4          0  1
7          1  2
10         1  0

设置：

print (df)
    timestamp  address type
0           1        1    A
1           2        9    B
2           3        3    A
3           4        6    B
4           5        6    B
5           6        2    B
6           7        3    A
7           8        2    B
8           9        1    B
9          10        3    A
10         11        3    A
11         12        3    A

datetimes 的解决方案更简单：

#sample datetimes 
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='D',
                   origin=pd.Timestamp('2017-01-01'))

print (df)
    timestamp  address type
0  2017-01-02        1    A
1  2017-01-03        9    B
2  2017-01-04        3    A
3  2017-01-05        6    B
4  2017-01-06        6    B
5  2017-01-07        2    B
6  2017-01-08        3    A
7  2017-01-09        2    B
8  2017-01-10        1    B
9  2017-01-11        3    A
10 2017-01-12        3    A
11 2017-01-13        3    A

df1 = df[df['address'].isin(my_list)]
df1 = (df1.groupby([pd.Grouper(freq='3D', key='timestamp'), 'type'])['address']
          .nunique()
          .unstack(fill_value=0) )
print (df1)
type        A  B
timestamp       
2017-01-02  2  0
2017-01-05  0  1
2017-01-08  1  2
2017-01-11  1  0

还有一行解决方案：

df1 = (df.query("address in @my_list")
         .groupby([pd.Grouper(freq='3D', key='timestamp'), 'type'])['address']
         .nunique()
         .unstack(fill_value=0))
print (df1)
type        A  B
timestamp       
2017-01-02  2  0
2017-01-05  0  1
2017-01-08  1  2
2017-01-11  1  0

【讨论】：

谢谢@jezrael ...我相信这正是我想要的。不过请解决一个小问题。我有一个 float64 日期时间索引，因此出现以下错误。 “TypeError：仅对 DatetimeIndex、TimedeltaIndex 或 PeriodIndex 有效，但有一个 'Float64Index' 的实例”。如果我将索引从 Float64Index 更改为正常的 DateTimeIndex，则会降低准确性。
什么问题？
> data.index 返回 ('1970-01-01 00:09:59.998773', '1970-01-01 00:09:59.998786', '1970-01-01 00:09: 59.999896999'], dtype='datetime64[ns]', name='Timestamp', length=816578, freq=None)
你使用参数key='timestamp'吗？
如果使用Datetimeindex，则只需将pd.Grouper(freq='3D', key='timestamp')更改为pd.Grouper(freq='3D')

【解决方案3】：

这是一种创建参考列的方法，pivot_table 即

# Group every three column by finding %3 and cumcount 
df['temp'] = df.groupby([df['timestamp']%3]).cumcount()

# Replace the values absent in list by nan
df['add'] = df['address'].where(df['address'].isin(li),np.nan)

# Create the index of time stamp whos mod value is 1 
idx = df['timestamp'][df['timestamp']%3==1]  

# Pivot table with agg function number of unqiue values based on newly created columns, fill nan with 0. 
ndf = df.pivot_table('add','type','temp',aggfunc='nunique',fill_value=0).T.set_index(idx)

输出：

A B型时间戳 1 2 0 4 0 1 7 1 2 10 1 0

【讨论】：