【问题标题】:Python - Speed up for converting a categorical variable to it's numerical indexPython - 加速将分类变量转换为其数字索引
【发布时间】:2016-10-06 23:06:38
【问题描述】:

我需要将 Pandas 数据框中的一列分类变量转换为对应于索引的数值,转换为列中唯一分类变量的数组(长话短说!),这是一个代码 sn-p实现:

import pandas as pd
import numpy as np

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
uniq_lab = np.unique(df['col'])

for lab in uniq_lab:
    df['col'].replace(lab,np.where(uniq_lab == lab)[0][0].astype(float),inplace=True)

转换数据框:

    col
 0  baked
 1  beans
 2  baked
 3  baked
 4  beans

进入数据框:

    col
 0  0.0
 1  1.0
 2  0.0
 3  0.0
 4  1.0

根据需要。但我的问题是,当我尝试在大数据文件上运行类似代码时,我的愚蠢的小 for 循环(我想到的唯一方法)就像糖蜜一样慢。我只是好奇是否有人对是否有任何方法可以更有效地做到这一点有任何想法。提前感谢您的任何想法。

【问题讨论】:

    标签: python performance numpy pandas dataframe


    【解决方案1】:

    使用factorize:

    df['col'] = pd.factorize(df.col)[0]
    print (df)
       col
    0    0
    1    1
    2    0
    3    0
    4    1
    

    Docs

    编辑:

    正如评论中提到的Jeff,那么最好将列转换为categorical,主要是因为更少memory usage

    df['col'] = df['col'].astype("category")
    

    时间安排

    有趣的是,在大 df 中 pandasnumpy 更快。我不敢相信。

    len(df)=500k:

    In [29]: %timeit (a(df1))
    100 loops, best of 3: 9.27 ms per loop
    
    In [30]: %timeit (a1(df2))
    100 loops, best of 3: 9.32 ms per loop
    
    In [31]: %timeit (b(df3))
    10 loops, best of 3: 24.6 ms per loop
    
    In [32]: %timeit (b1(df4))
    10 loops, best of 3: 24.6 ms per loop  
    

    len(df)=5k

    In [38]: %timeit (a(df1))
    1000 loops, best of 3: 274 µs per loop
    
    In [39]: %timeit (a1(df2))
    The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.
    1000 loops, best of 3: 273 µs per loop
    
    In [40]: %timeit (b(df3))
    The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.
    1000 loops, best of 3: 295 µs per loop
    
    In [41]: %timeit (b1(df4))
    1000 loops, best of 3: 294 µs per loop
    

    len(df)=5:

    In [46]: %timeit (a(df1))
    1000 loops, best of 3: 206 µs per loop
    
    In [47]: %timeit (a1(df2))
    1000 loops, best of 3: 204 µs per loop
    
    In [48]: %timeit (b(df3))
    The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.
    10000 loops, best of 3: 164 µs per loop
    
    In [49]: %timeit (b1(df4))
    The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.
    10000 loops, best of 3: 164 µs per loop
    

    测试代码

    d = {'col': ["baked","beans","baked","baked","beans"]}
    df = pd.DataFrame(data=d)
    print (df)
    df = pd.concat([df]*100000).reset_index(drop=True)
    #test for 5k
    #df = pd.concat([df]*1000).reset_index(drop=True)
    
    
    df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()
    
    def a(df):
        df['col'] = pd.factorize(df.col)[0]
        return df
    
    def a1(df):
        idx,_ = pd.factorize(df.col)
        df['col'] = idx
        return df
    
    def b(df):
        df['col'] = np.unique(df['col'],return_inverse=True)[1]
        return df
    
    def b1(df):
        _,idx = np.unique(df['col'],return_inverse=True)
        df['col'] = idx    
        return df
    
    print (a(df1))    
    print (a1(df2))   
    print (b(df3))   
    print (b1(df4))  
    

    【讨论】:

    • 如果我对熊猫有更多的了解,也许我会更感激它,但这也有效!也许做一些像idx,_ = pd.factorize(df.col) 这样的事情,也许会更快一点?再次,这是一种直觉:)
    • 我希望我曾经开始学习numpy - 有很多不错的功能,而且速度更快。谢谢你。是的,没错,我要做一些测试。
    • 嗯,有趣,大号 df pandasnumpy 快。
    • np.unique 排序; pd.factorize 不会并保证按出现顺序返回唯一性
    • 更好的是使用分类,这正是他们所做的
    【解决方案2】:

    您可以使用np.unique 的可选参数return_inverse 根据每个字符串的唯一性来标识每个字符串,并在输入数据框中设置这些字符串,如下所示 -

    _,idx = np.unique(df['col'],return_inverse=True)
    df['col'] = idx
    

    请注意,IDs 对应于唯一按字母顺序排序的字符串数组。如果你必须得到那个唯一的数组,你可以用它替换_,就像这样-

    uniq_lab,idx = np.unique(df['col'],return_inverse=True)
    

    示例运行 -

    >>> d = {'col': ["baked","beans","baked","baked","beans"]}
    >>> df = pd.DataFrame(data=d)
    >>> df
         col
    0  baked
    1  beans
    2  baked
    3  baked
    4  beans
    >>> _,idx = np.unique(df['col'],return_inverse=True)
    >>> df['col'] = idx
    >>> df
       col
    0    0
    1    1
    2    0
    3    0
    4    1
    

    【讨论】:

    • @jezrael 好吧,我只是希望categorical variables 不会有那些NonesNaNs :)
    • 是的,但在真实数据中是可能的。 :) 顺便说一句,也许更好的是df['col'] = np.unique(df['col'],return_inverse=True)[1]
    • @jezrael Well `df['col'] = np.unique(df['col'],return_inverse=True) 将计算唯一标签和 ID,然后选择带有 [1] 的第二个元素,我认为这可能会对性能造成一些影响。因此,使用_,idx,我认为计算唯一标签不会打扰自己,而且可能会更快一些。不过里面有一点直觉:)
    猜你喜欢
    • 2018-06-03
    • 1970-01-01
    • 1970-01-01
    • 2019-03-29
    • 2021-01-01
    • 2022-07-14
    • 1970-01-01
    • 1970-01-01
    • 2021-05-03
    相关资源
    最近更新 更多