【问题标题】:Unable to run logistic regression due to "perfect separation error"由于“完美分离错误”而无法运行逻辑回归
【发布时间】:2016-04-12 15:40:35
【问题描述】:

我是 Python 数据分析的初学者,但在完成这项特定任务时遇到了麻烦。我进行了相当广泛的搜索,但无法确定问题所在。

我导入了一个文件并将其设置为数据框。清理文件中的数据。但是,当我尝试将我的模型拟合到数据时,我得到一个

检测到完美分离,结果不可用

代码如下:

from scipy import stats
import numpy as np
import pandas as pd 
import collections
import matplotlib.pyplot as plt
import statsmodels.api as sm

loansData = pd.read_csv('https://spark-   public.s3.amazonaws.com/dataanalysis/loansData.csv')

loansData = loansData.to_csv('loansData_clean.csv', header=True, index=False)

## cleaning the file
loansData['Interest.Rate'] = loansData['Interest.Rate'].map(lambda x:  round(float(x.rstrip('%')) / 100, 4))
loanlength = loansData['Loan.Length'].map(lambda x: x.strip('months'))
loansData['FICO.Range'] = loansData['FICO.Range'].map(lambda x: x.split('-'))
loansData['FICO.Range'] = loansData['FICO.Range'].map(lambda x: int(x[0]))
loansData['FICO.Score'] = loansData['FICO.Range']

#add interest rate less than column and populate
## we only care about interest rates less than 12%
loansData['IR_TF'] = pd.Series('', index=loansData.index)
loansData['IR_TF'] = loansData['Interest.Rate'].map(lambda x: True if x < 12 else False)

#create intercept column
loansData['Intercept'] = pd.Series(1.0, index=loansData.index)

# create list of ind var col names
ind_vars = ['FICO.Score', 'Amount.Requested', 'Intercept'] 

#define logistic regression
logit = sm.Logit(loansData['IR_TF'], loansData[ind_vars])

#fit the model
result = logit.fit()

#get fitted coef
coeff = result.params

print coeff

任何帮助将不胜感激!

谢谢, 一个

【问题讨论】:

    标签: python numpy pandas matplotlib logistic-regression


    【解决方案1】:

    您有 PerfectSeparationError,因为您的 loanData['IR_TF'] 只有一个值 True(或 1)。您首先将利率从 % 转换为小数,因此您应该将 IR_TF 定义为

    loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12 #not 12, and you don't need .map 
    

    那么你的回归就会成功运行:

    Optimization terminated successfully.
             Current function value: 0.319503
             Iterations 8
    FICO.Score           0.087423
    Amount.Requested    -0.000174
    Intercept          -60.125045
    dtype: float64
    

    另外,我注意到许多地方可以更容易阅读和/或获得一些性能改进,特别是 .map 可能不如矢量化计算快。以下是我的建议:

    from scipy import stats
    import numpy as np
    import pandas as pd 
    import collections
    import matplotlib.pyplot as plt
    import statsmodels.api as sm
    
    loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')
    
    ## cleaning the file
    loansData['Interest.Rate'] = loansData['Interest.Rate'].str.rstrip('%').astype(float).round(2) / 100.0
    
    loanlength = loansData['Loan.Length'].str.strip('months')#.astype(int)  --> loanlength not used below
    
    loansData['FICO.Score'] = loansData['FICO.Range'].str.split('-', expand=True)[0].astype(int)
    
    #add interest rate less than column and populate
    ## we only care about interest rates less than 12%
    loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12
    
    #create intercept column
    loansData['Intercept'] = 1.0
    
    # create list of ind var col names
    ind_vars = ['FICO.Score', 'Amount.Requested', 'Intercept'] 
    
    #define logistic regression
    logit = sm.Logit(loansData['IR_TF'], loansData[ind_vars])
    
    #fit the model
    result = logit.fit()
    
    #get fitted coef
    coeff = result.params
    
    #print coeff
    print result.summary() #result has more information
    
    
    Logit Regression Results                           
    ==============================================================================
    Dep. Variable:                  IR_TF   No. Observations:                 2500
    Model:                          Logit   Df Residuals:                     2497
    Method:                           MLE   Df Model:                            2
    Date:                Thu, 07 Jan 2016   Pseudo R-squ.:                  0.5243
    Time:                        23:15:54   Log-Likelihood:                -798.76
    converged:                       True   LL-Null:                       -1679.2
                                            LLR p-value:                     0.000
    ====================================================================================
                           coef    std err          z      P>|z|      [95.0% Conf. Int.]
    ------------------------------------------------------------------------------------
    FICO.Score           0.0874      0.004     24.779      0.000         0.081     0.094
    Amount.Requested    -0.0002    1.1e-05    -15.815      0.000        -0.000    -0.000
    Intercept          -60.1250      2.420    -24.840      0.000       -64.869   -55.381
    ====================================================================================
    

    顺便问一下——这是P2P借贷数据吗?

    【讨论】:

    • 这太棒了,谢谢!并感谢您的建议。我使用 .map 是因为这是我最熟悉的,但我知道我也应该使用列表推导。是的,它是 PSP 数据,但它是公共数据——而不是最新数据。我只是用它来教自己一些基本的数据分析。
    猜你喜欢
    • 1970-01-01
    • 2019-06-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-01-06
    • 1970-01-01
    • 2016-02-21
    相关资源
    最近更新 更多