【问题标题】:Sklearn preprocessing - PolynomialFeatures - How to keep column names/headers of the output array / dataframeSklearn 预处理 - PolynomialFeatures - 如何保留输出数组/数据帧的列名/标题
【发布时间】:2016-08-12 05:11:12
【问题描述】:

TLDR:如何从 sklearn.preprocessing.PolynomialFeatures() 函数中获取输出 numpy 数组的标头?


假设我有以下代码...

import pandas as pd
import numpy as np
from sklearn import preprocessing as pp

a = np.ones(3)
b = np.ones(3) * 2
c = np.ones(3) * 3

input_df = pd.DataFrame([a,b,c])
input_df = input_df.T
input_df.columns=['a', 'b', 'c']

input_df

    a   b   c
0   1   2   3
1   1   2   3
2   1   2   3

poly = pp.PolynomialFeatures(2)
output_nparray = poly.fit_transform(input_df)
print output_nparray

[[ 1.  1.  2.  3.  1.  2.  3.  4.  6.  9.]
 [ 1.  1.  2.  3.  1.  2.  3.  4.  6.  9.]
 [ 1.  1.  2.  3.  1.  2.  3.  4.  6.  9.]]

如何让 3x10 矩阵/ output_nparray 继承 a、b、c 标签与上述数据的关系?

【问题讨论】:

    标签: python python-2.7 validation scikit-learn cross-validation


    【解决方案1】:

    这行得通:

    def PolynomialFeatures_labeled(input_df,power):
        '''Basically this is a cover for the sklearn preprocessing function. 
        The problem with that function is if you give it a labeled dataframe, it ouputs an unlabeled dataframe with potentially
        a whole bunch of unlabeled columns. 
    
        Inputs:
        input_df = Your labeled pandas dataframe (list of x's not raised to any power) 
        power = what order polynomial you want variables up to. (use the same power as you want entered into pp.PolynomialFeatures(power) directly)
    
        Ouput:
        Output: This function relies on the powers_ matrix which is one of the preprocessing function's outputs to create logical labels and 
        outputs a labeled pandas dataframe   
        '''
        poly = pp.PolynomialFeatures(power)
        output_nparray = poly.fit_transform(input_df)
        powers_nparray = poly.powers_
    
        input_feature_names = list(input_df.columns)
        target_feature_names = ["Constant Term"]
        for feature_distillation in powers_nparray[1:]:
            intermediary_label = ""
            final_label = ""
            for i in range(len(input_feature_names)):
                if feature_distillation[i] == 0:
                    continue
                else:
                    variable = input_feature_names[i]
                    power = feature_distillation[i]
                    intermediary_label = "%s^%d" % (variable,power)
                    if final_label == "":         #If the final label isn't yet specified
                        final_label = intermediary_label
                    else:
                        final_label = final_label + " x " + intermediary_label
            target_feature_names.append(final_label)
        output_df = pd.DataFrame(output_nparray, columns = target_feature_names)
        return output_df
    
    output_df = PolynomialFeatures_labeled(input_df,2)
    output_df
    
        Constant Term   a^1 b^1 c^1 a^2 a^1 x b^1   a^1 x c^1   b^2 b^1 x c^1   c^2
    0               1   1   2   3   1           2           3   4           6   9
    1               1   1   2   3   1           2           3   4           6   9
    2               1   1   2   3   1           2           3   4           6   9
    

    【讨论】:

      【解决方案2】:

      工作示例,全部在一行中(我假设“可读性”不是这里的目标):

      target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(input_df.columns,p) for p in poly.powers_]]
      output_df = pd.DataFrame(output_nparray, columns = target_feature_names)
      

      更新:正如@OmerB 指出的,现在您可以使用get_feature_names method

      >> poly.get_feature_names(input_df.columns)
      ['1', 'a', 'b', 'c', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2']
      

      【讨论】:

        【解决方案3】:

        scikit-learn 0.18 添加了一个漂亮的get_feature_names() 方法!

        >> input_df.columns
        Index(['a', 'b', 'c'], dtype='object')
        
        >> poly.fit_transform(input_df)
        array([[ 1.,  1.,  2.,  3.,  1.,  2.,  3.,  4.,  6.,  9.],
               [ 1.,  1.,  2.,  3.,  1.,  2.,  3.,  4.,  6.,  9.],
               [ 1.,  1.,  2.,  3.,  1.,  2.,  3.,  4.,  6.,  9.]])
        
        >> poly.get_feature_names(input_df.columns)
        ['1', 'a', 'b', 'c', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2']
        

        请注意,您必须为其提供列名,因为 sklearn 不会自行从 DataFrame 中读取它。

        【讨论】:

          【解决方案4】:

          get_feature_names() 方法很好,但它会将所有变量返回为'x1''x2''x1 x2'、...等。下面是一个快速将get_feature_names() 输出转换为格式为'Col_1''Col_2''Col_1 x Col_2' 的列名列表的函数:

          在:

          def PolynomialFeatureNames(sklearn_feature_name_output, df):
          """
          This function takes the output from the .get_feature_names() method on the PolynomialFeatures 
          instance and replaces values with df column names to return output such as 'Col_1 x Col_2'
          
          sklearn_feature_name_output: The list object returned when calling .get_feature_names() on the PolynomialFeatures object
          df: Pandas dataframe with correct column names
          """
          import re
          cols = df.columns.tolist()
          feat_map = {'x'+str(num):cat for num, cat in enumerate(cols)}
          feat_string = ','.join(sklearn_feature_name_output)
          for k,v in feat_map.items():
              feat_string = re.sub(fr"\b{k}\b",v,feat_string)
          return feat_string.replace(" "," x ").split(',')  
          
          interaction = PolynomialFeatures(degree=2)
          X_inter = interaction.fit_transform(input_df)
          
          names = PolynomialFeatureNames(interaction.get_feature_names(),input_df)
          
          print(pd.DataFrame(X_inter, columns= names))
          

          输出:

                      1       a       b       c     a^2   a x b   a x c     b^2   b x c  \
          0 1.00000 1.00000 2.00000 3.00000 1.00000 2.00000 3.00000 4.00000 6.00000   
          1 1.00000 1.00000 2.00000 3.00000 1.00000 2.00000 3.00000 4.00000 6.00000   
          2 1.00000 1.00000 2.00000 3.00000 1.00000 2.00000 3.00000 4.00000 6.00000   
          
                c^2  
          0 9.00000  
          1 9.00000  
          2 9.00000
          

          【讨论】:

            猜你喜欢
            • 2015-06-17
            • 2020-05-01
            • 2020-10-31
            • 2020-03-13
            • 2020-04-14
            • 2019-01-25
            • 2021-08-17
            • 2021-07-31
            • 2018-04-22
            相关资源
            最近更新 更多