【发布时间】:2021-09-18 12:55:39
【问题描述】:
我正在处理一些原始数据并希望计算链中某些指标(列“Stat”)的实例(由列“chain_id”中记录的唯一标识符“c”命名),这被保存到一个字典然后映射到一个新列(下面未显示)。
但我希望:
- 提高循环的速度,我必须在 34k 行上将速度从最初的 ~3 提高到 ~10 it/s。
- 改进各种 try/except 语句的结构,注意每个链在 value_counts() 输出中并不总是有例如“Kick”或“Mark”等,因此这些必须为 0 .
我已经在 SOF 上搜索了其他方法,但现有答案都不适合 - 请忽略 for 循环的缩进,它不允许我更正它
import pandas as pd
from tqdm.notebook import tqdm
s = ['Hitout', 'Kick', 'Disposal', 'Centre Clearance', 'Tackle', 'Hitout',
'Hitout To Advantage', 'Free Against', 'Contested Possession', 'Free For',
'Handball', 'Disposal', 'Effective Disposal', 'Stoppage Clearance',
'Uncontested Possession', 'Kick', 'Effective Kick', 'Disposal', 'Effective Disposal',
'Mark', 'Uncontested Possession', 'F 50 Mark', 'Mark On Lead', 'Kick', 'Disposal',
'Shot At Goal', 'Behind', 'Kick In', 'One Percenter', 'Kick', 'Effective Kick',
'Disposal', 'Effective Disposal', 'Rebound 50', 'Spoil', 'One Percenter']
x = ['Hitout', 'RI-1', 'RI-1', 'RI-1', 'RI-1', 'Hitout', 'Hitout', 'RI-7', 'RI-7',
'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7',
'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'RI-7', 'CA-27', 'CA-27',
'CA-27', 'CA-27', 'CA-27', 'CA-27', 'CA-27', 'CA-27', 'CA-27']
df = pd.DataFrame({'chain_id':x,'Stat':s})
for c in tqdm(chains):
if c == 'Hitout':
chain_count[c] = 0
hb_count[c] = 0
ki_count[c] = 0
m_count[c] = 0
goal_count[c] = 0
behind_count[c] = 0
cp_count[c] = 0
up_count[c] = 0
t_count[c] = 0
chain_time[c] = 0
else:
temp = df[df['chain_id']==c]['Stat'].value_counts()
try:
chain_count[c] = temp['Disposal']
except:
chain_count[c] = 0
try:
ki_count[c] = temp['Kick']
except:
ki_count[c] = 0
try:
hb_count[c] = temp['Handball']
except:
hb_count[c] = 0
try:
m_count[c] = temp['Mark']
except:
m_count[c] = 0
try:
goal_count[c] = temp['Goal']
except:
goal_count[c] = 0
try:
behind_count[c] = temp['Behind']
except:
behind_count[c] = 0
try:
cp_count[c] = temp['Contested Possession']
except:
cp_count[c] = 0
try:
up_count[c] = temp['Uncontested Possession']
except:
up_count[c] = 0
try:
t_count[c] = temp['Tackle']
except:
t_count[c] = 0
chain_time[c] = time(c)
df['chain_length'] = df['chain_id'].map(chain_count)
df['chain_hb'] = df['chain_id'].map(hb_count)
df['chain_ki'] = df['chain_id'].map(ki_count)
df['chain_m'] = df['chain_id'].map(m_count)
df['chain_goal'] = df['chain_id'].map(goal_count)
df['chain_behind'] = df['chain_id'].map(behind_count)
df['chain_cp'] = df['chain_id'].map(cp_count)
df['chain_up'] = df['chain_id'].map(up_count)
df['chain_t'] = df['chain_id'].map(t_count)
df['chain_time'] = df['chain_id'].map(chain_time)
已编辑:包含一个示例,并在下面输出它当前的工作方式
【问题讨论】:
-
你能给我们一个示例数据框和所需的输出吗?
-
@wwnde 我已经更新了上面的内容以包含数据样本以及每个字典在映射时的外观
-
简单来说S,chain_id,Stat,chain_length和chain_b是什么关系?
-
@wwnde 'chain_id' 代表一个独特的控球链,'Stat' 是在独特链中捕获的一种统计数据,例如踢球、手球、得分。 “chain_length”是每个唯一chain_id 的“Stat”列中“disposals”的计数/频率,“chain_hb”是每个唯一链 id 的“Stat”列中的“handball”频率。我希望这会有所帮助
标签: python-3.x pandas performance dictionary try-catch