【问题标题】:Dict in loop for pd.DataFramepd.DataFrame 循环中的字典
【发布时间】:2016-12-29 10:00:19
【问题描述】:

我的数据集中有很多列,我需要更改一些变量的值。我这样做如下

import pandas as pd
import numpy as np
df = pd.DataFrame({'one':['a' , 'b']*5, 'two':['c' , 'd']*5, 'three':['a' , 'd']*5})

选择

df1 = df[['one', 'two']]

字典

map = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'}

然后循环

df2=[]
for i in df1.values:
    np = [ map[x] for x in i]
    df2.append(np)

然后我改变列

df['one'] = [row[0] for row in df2]
df['two'] = [row[1] for row in df2]

它有效,但它的路很长。如何让它更短?

【问题讨论】:

标签: python pandas for-loop dictionary


【解决方案1】:

您可以使用Series.map() 遍历列:

cols = ['one', 'two']
mapd = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'}

for col in cols:
    df[col] = df[col].map(mapd).fillna(df[col])


df
Out: 
  one three two
0   d     a   b
1   c     d   a
2   d     a   b
3   c     d   a
4   d     a   b
5   c     d   a
6   d     a   b
7   c     d   a
8   d     a   b
9   c     d   a

时间安排:

df = pd.DataFrame({'one':['a' , 'b']*5000000, 
                   'two':['c' , 'd']*5000000, 
                   'three':['a' , 'd']*5000000})

%%timeit
for col in cols:
    df[col].map(mapd).fillna(df[col])
1 loop, best of 3: 1.71 s per loop

%%timeit
for col in cols:
...  colSet = set(df[col].values);
...  colMap = {k:v for k,v in mapd.items() if k in colSet}
...  df.replace(to_replace={col:colMap})
1 loop, best of 3: 3.35 s per loop


%timeit df[cols].stack().map(mapd).unstack()
1 loop, best of 3: 9.18 s per loop

【讨论】:

    【解决方案2】:

    将整个地图传递给只有 'a','b' 值的 col 效率不高。首先检查 df col 中的值。然后只为他们映射,如下所示:

    >>> cols = ['one', 'two'];
    >>> map = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'};
    
    >>> for col in cols:
    ...  colSet = set(df[col].values);
    ...  colMap = {k:v for k,v in map.items() if k in colSet};
    ...  df.replace(to_replace={col:colMap},inplace=True);#not efficient like rly
    ...  
    >>> df
      one three two
    0   d     a   b
    1   c     d   a
    2   d     a   b
    3   c     d   a
    4   d     a   b
    5   c     d   a
    6   d     a   b
    7   c     d   a
    8   d     a   b
    9   c     d   a
    >>>
    #OR
    In [12]: %%timeit
    ...: for col in cols:
    ...:  colSet = set(df[col].values);
    ...:  colMap = {k:v for k,v in map.items() if k in colSet};
    ...:  df[col].map(colMap)
    ...:
    ...:
    1 loop, best of 3: 1.93 s per loop 
    #OR WHEN INPLACE
    In [8]: %%timeit
       ...: for col in cols:
       ...:  colSet = set(df[col].values);
       ...:  colMap = {k:v for k,v in map.items() if k in colSet};
       ...:  df[col]=df[col].map(colMap)
       ...:
       ...:
    1 loop, best of 3: 2.18 s per loop
    

    这也是可能的:

    df = pd.DataFrame({'one':['a' , 'b']*5, 'two':['c' , 'd']*5, 'three':['a' , 'd']*5})
    map = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'}
    cols = ['one','two']
    
    def func(s):
        if s.name in cols:
            s=s.map(map)
        return s
    
    print df.apply(func)
    

    还要注意重叠的键(即,如果您想并行更改,可以说 a 到 b,b 到 c,但不像 a->b->c)...

    >>> cols = ['one', 'two'];
    >>> map = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'};
    >>> mapCols = {k:map for k in cols};
    >>> df.replace(to_replace=mapCols,inplace=True);
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "Q:\Miniconda3\envs\py27a\lib\site-packages\pandas\core\generic.py", line 3352, in replace
        raise ValueError("Replacement not allowed with "
    ValueError: Replacement not allowed with overlapping keys and values
    

    【讨论】:

    • 这个比低效慢两倍。
    • ... 没有仔细检查(它只是猜测,但逻辑上思考应该不会错,只是我的实现可能没那么快;/)。是不是这个 df.replace 效率不高?
    • 替换通常较慢(即使一次将其应用于整个 DataFrame 与使用 map 循环遍历列),因为 map 更加具体和有限。我不认为差异来自您的实施。是什么让您认为 Series.map() 的实际实现会浪费时间处理不存在的密钥?
    【解决方案3】:
    df = pd.DataFrame({'one':['a' , 'b']*5, 'two':['c' , 'd']*5, 'three':['a' , 'd']*5})
    m = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'}
    
    cols = ['one', 'two']
    df[cols] = df[cols].stack().map(m).unstack()
    df
    

    【讨论】:

      猜你喜欢
      • 2022-11-17
      • 2013-05-16
      • 2020-02-10
      • 2021-05-23
      • 1970-01-01
      • 2011-08-21
      相关资源
      最近更新 更多