【问题标题】:Data processing with adding columns dynamically in Python Pandas Dataframe在 Python Pandas Dataframe 中动态添加列的数据处理
【发布时间】:2014-05-10 10:40:27
【问题描述】:

我有以下问题。 假设这是我的 CSV

id f1 f2 f3
1  4  5  5
1  3  1  0
1  7  4  4
1  4  3  1
1  1  4  6
2  2  6  0
..........

所以,我有可以按 id 分组的行。 我想创建一个如下所示的 csv 作为输出。

f1 f2 f3 f1_n f2_n f3_n f1_n_n f2_n_n f3_n_n f1_t f2_t f3_t
4  5  5   3   1    0    7      4      4      1   4     6  

所以,我希望能够选择要转换为列的行数(总是从 id 的第一行开始)。在这种情况下,我抓住了 3 行。 然后我还将跳过一个或多个行(在这种情况下只有一个跳过)以从同一 id 组的最后一行中获取最后一列。由于某些原因,我想使用数据框。

在挣扎了 3-4 小时后。我找到了如下给出的解决方案。 但我的解决方案很慢。我有大约 700,000 行,可能是大约 70,000 组 id。上面 model=3 的代码在我的 4GB 4 Core Lenovo 上需要将近一个小时。我需要使用模型 = 可能是 10 或 15。我仍然是 Python 新手,我相信可以进行一些更改来加快速度。有人可以深入解释我如何改进代码。

非常感谢。

model : 要抓取的行数

# train data frame from reading the csv
train = pd.read_csv(filename)

# Get groups of rows with same id
csv_by_id = train.groupby('id')

modelTarget = { 'f1_t','f2_t','f3_t'}

# modelFeatures is a list of features I am interested in the csv. 
    # The csv actually has hundreds
modelFeatures = { 'f1, 'f2' , 'f3' }

coreFeatures = list(modelFeatures) # cloning 


selectedFeatures = list(modelFeatures) # cloning

newFeatures = list(selectedFeatures) # cloning

finalFeatures = list(selectedFeatures) # cloning

# Now create the column list depending on the number of rows I will grab from
for x in range(2,model+1):
    newFeatures = [s + '_n' for s in newFeatures]
    finalFeatures = finalFeatures + newFeatures

# This is the final column list for my one row in the final data frame
selectedFeatures = finalFeatures + list(modelTarget) 

# Empty dataframe which I want to populate
model_data = pd.DataFrame(columns=selectedFeatures)

for id_group in csv_by_id:
    #id_group is a tuple with first element as the id itself and second one a dataframe with the rows of a group
    group_data = id_group[1] 

    #hmm - can this be better? I am picking up the rows which I need from first row on wards
    df = group_data[coreFeatures][0:model] 

    # initialize a list
    tmp = [] 

    # now keep adding the column values into the list
    for index, row in df.iterrows(): 
        tmp = tmp + list(row)


    # Wow, this one below surely should have something better. 
    # So i am picking up the feature column values from the last row of the group of rows for a particular id 
    targetValues = group_data[list({'f1','f2','f3'})][len(group_data.index)-1:len(group_data.index)].values 

    # Think this can be done easier too ? . Basically adding the values to the tmp list again
    tmp = tmp + list(targetValues.flatten()) 

    # coverting the list to a dict.
    tmpDict = dict(zip(selectedFeatures,tmp))  

    # then the dict to a dataframe.
    tmpDf = pd.DataFrame(tmpDict,index={1}) 

    # I just could not find a better way of adding a dict or list directly into a dataframe. 
    # And I went through lots and lots of blogs on this topic, including some in StackOverflow.

    # finally I add the frame to my main frame
    model_data = model_data.append(tmpDf) 

# and write it
model_data.to_csv(wd+'model_data' + str(model) + '.csv',index=False) 

【问题讨论】:

    标签: python pandas dataframe data-processing


    【解决方案1】:

    Groupby 是你的朋友。

    这将很好地扩展;特征数量只有一个很小的常数。大约是 O(组数)

    In [28]: features = ['f1','f2','f3']
    

    创建一些测试数据,组大小为7-12,70k组

    In [29]: def create_df(i):
       ....:     l = np.random.randint(7,12)
       ....:     df = DataFrame(dict([ (f,np.arange(l)) for f in features ]))
       ....:     df['A'] = i
       ....:     return df
       ....: 
    
    In [30]: df = concat([ create_df(i) for i in xrange(70000) ])
    
    In [39]: df.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 629885 entries, 0 to 9
    Data columns (total 4 columns):
    f1    629885 non-null int64
    f2    629885 non-null int64
    f3    629885 non-null int64
    A     629885 non-null int64
    dtypes: int64(4)
    

    创建一个框架,您可以在其中从每个组中选择前 3 行和最后一行(请注意,这将处理大小 groupby.filter 到解决这个问题)

    In [31]: groups = concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()
    
    # This step is necesary in pandas < master/0.14 as the returned fields 
    # will include the grouping field (the A), (is a bug/API issue)
    In [33]: groups = groups[features]
    
    In [34]: groups.head(20)
    Out[34]: 
         f1  f2  f3
    A              
    0 0   0   0   0
      1   1   1   1
      2   2   2   2
      7   7   7   7
    1 0   0   0   0
      1   1   1   1
      2   2   2   2
      9   9   9   9
    2 0   0   0   0
      1   1   1   1
      2   2   2   2
      8   8   8   8
    3 0   0   0   0
      1   1   1   1
      2   2   2   2
      8   8   8   8
    4 0   0   0   0
      1   1   1   1
      2   2   2   2
      9   9   9   9
    
    [20 rows x 3 columns]
    
    In [38]: groups.info()
    <class 'pandas.core.frame.DataFrame'>
    MultiIndex: 280000 entries, (0, 0) to (69999, 9)
    Data columns (total 3 columns):
    f1    280000 non-null int64
    f2    280000 non-null int64
    f3    280000 non-null int64
    dtypes: int64(3)
    

    而且相当快

    In [32]: %timeit concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()
    1 loops, best of 3: 1.16 s per loop
    

    对于进一步的操作,您通常应该在这里停下来并使用它(因为它的分组格式很好,很容易处理)。

    如果你想把它翻译成宽格式

    In [35]: dfg = groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))
    
    In [36]: %timeit groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))
    dfg.head()
    groups.info()
    1 loops, best of 3: 14.5 s per loop
    In [40]: dfg.columns = [ "{0}_{1}".format(f,i) for i in range(1,5) for f in features ]
    
    In [41]: dfg.head()
    Out[41]: 
       f1_1  f2_1  f3_1  f1_2  f2_2  f3_2  f1_3  f2_3  f3_3  f1_4  f2_4  f3_4
    A                                                                        
    0     0     0     0     1     1     1     2     2     2     7     7     7
    1     0     0     0     1     1     1     2     2     2     9     9     9
    2     0     0     0     1     1     1     2     2     2     8     8     8
    3     0     0     0     1     1     1     2     2     2     8     8     8
    4     0     0     0     1     1     1     2     2     2     9     9     9
    
    [5 rows x 12 columns]
    
    In [42]: dfg.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 70000 entries, 0 to 69999
    Data columns (total 12 columns):
    f1_1    70000 non-null int64
    f2_1    70000 non-null int64
    f3_1    70000 non-null int64
    f1_2    70000 non-null int64
    f2_2    70000 non-null int64
    f3_2    70000 non-null int64
    f1_3    70000 non-null int64
    f2_3    70000 non-null int64
    f3_3    70000 non-null int64
    f1_4    70000 non-null int64
    f2_4    70000 non-null int64
    f3_4    70000 non-null int64
    dtypes: int64(12)
    

    【讨论】:

    • 哇!这就是为什么我非常喜欢stackoverflow。杰夫,我会慢慢研究你的答案。我会尽快回复您。我曾经犯过一个错误,我错过了代码中的第一行,我实际上是使用 groupby 获取 csv_by_id 的。我正在我的代码中添加/编辑该行。
    • 杰夫成功了。它将我的代码减少到 6 行。谢谢。这两行 dfg = groups.groupby(level=0).apply(lambda x: pd.Series(x.values.ravel()))dfg.columns = [ "{ 0}_{1}".format(f,i) for i in range(1,5) for f in coreFeatures ] 是杀手。 Python 是一门艺术。
    • gr8。诀窍是始终矢量化,不惜一切代价避免循环,并且只工作一次
    • 当然。我明白了。想想向量。谢谢。
    猜你喜欢
    • 2013-09-27
    • 1970-01-01
    • 2017-10-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-11-15
    • 2018-08-13
    相关资源
    最近更新 更多