【问题标题】:IndexError: positional indexers are out-of-bounds stratify sklearn test_train_splitIndexError:位置索引器超出范围分层 sklearn test_train_split
【发布时间】:2017-03-31 10:36:10
【问题描述】:

我在 sklearn cross_validation train_test_split 模块中使用 pandas 数据框。

d=pandas.DataFrame({'a':np.random.randn(300),
                    'c':np.array([el for el in np.ones(100)]+
                                 [el for el in np.zeros(200)])})
from sklearn import cross_validation
(X,y)=(d['a'],d['c'])

这行得通

X_train_and_cv, X_test,y_train_and_cv,y_test = sklearn.cross_validation.train_test_split(X,y,test_size=0.2,random_state=0)
X_train, X_cv,y_train,y_cv = sklearn.cross_validation.train_test_split(X_train_and_cv,y_train_and_cv,test_size=0.2,random_state=0)

为什么这不起作用?

X_train_and_cv, X_test,y_train_and_cv,y_test = sklearn.cross_validation.train_test_split(X,y,test_size=0.2,random_state=0,stratify=y)
X_train, X_cv,y_train,y_cv = sklearn.cross_validation.train_test_split(X_train_and_cv,y_train_and_cv,test_size=0.2,random_state=0,stratify=y)

in _is_valid_list_like(self, key, axis)
   1536         l = len(ax)
   1537         if len(arr) and (arr.max() >= l or arr.min() < -l):
-> 1538             raise IndexError("positional indexers are out-of-bounds")
   1539 
   1540         return True

IndexError: positional indexers are out-of-bounds

【问题讨论】:

    标签: python pandas scikit-learn


    【解决方案1】:

    TL;DR:您对train_test_split 的第二次调用使用的stratify 的数组长度与您使用的y 的数组长度不同。使用stratify=y_train_and_cv


    首先,附带一点说明:cross_validation(0.17.1 文档here)很快就会被弃用,您应该改用model_selection.train_test_split (0.18.1)。我将导入train_test_split itself 以缩短以下内容的长度:

    # Same as this in older versions:
    # from sklearn.cross_validation import train_test_split
    from sklearn.model_selection import train_test_split 
    

    这很好:

    X_train_and_cv, X_test,y_train_and_cv,y_test = train_test_split(X,y,
                                                                    test_size=0.2,
                                                                    random_state=0,
                                                                    stratify=y)
    

    这不好,因为y=y_train_and_cv(len=240) stratify=y (len=300)

    X_train, X_cv,y_train,y_cv = train_test_split(X_train_and_cv,
                                                  y_train_and_cv,
                                                  test_size=0.2,
                                                  random_state=0,
                                                  stratify=y)
    

    替换为:

    X_train, X_cv,y_train,y_cv = train_test_split(X_train_and_cv,
                                                  y_train_and_cv,
                                                  test_size=0.2,
                                                  random_state=0,
                                                  stratify=y_train_and_cv)
    

    【讨论】:

    • 哇,我现在意识到我将y 解释为字符串而不是变量参数——例如,stratify = 'yes'——并假设它正在推断 to-stratify-on第二个参数的数组..
    • 啊!那将是 stratify=True :)
    猜你喜欢
    • 2018-03-03
    • 2017-08-02
    • 2019-03-21
    • 1970-01-01
    • 2019-09-24
    • 1970-01-01
    • 2015-01-22
    • 2016-06-28
    • 2019-11-25
    相关资源
    最近更新 更多