【问题标题】:Python Machine Learning Trained Classifer Error index is out of boundsPython 机器学习训练的分类器错误索引超出范围
【发布时间】:2017-12-22 09:32:36
【问题描述】:

我有一个训练有素的分类器, 运行良好。

我试图修改它以使用循环处理多个 .csv 文件,但这已经破坏了它,以至于原始代码(工作正常)现在返回与之前的 .csv 文件相同的错误处理没有任何问题。

我很困惑,看不出在之前一切正常时会突然出现这个错误的原因。原始(工作)代码是;

    # -*- coding: utf-8 -*-

    import csv
    import pandas
    import numpy as np
    import sklearn.ensemble as ske
    import re
    import os
    import collections
    import pickle
    from sklearn.externals import joblib
    from sklearn import model_selection, tree, linear_model, svm


    # Load dataset
    url = 'test_6_During_100.csv'
    dataset = pandas.read_csv(url)
    dataset.set_index('Name', inplace = True)
    ##dataset = dataset[['ProcessorAffinity','ProductVersion','Handle','Company',
    ##            'UserProcessorTime','Path','Product','Description',]]

    # Open file to output everything to
    new_url = re.sub('\.csv$', '', url)
    f = open(new_url + " output report", 'w')
    f.write(new_url + " output report\n")
    f.write("\n")


    # shape
    print(dataset.shape)
    print("\n")
    f.write("Dataset shape " + str(dataset.shape) + "\n")
    f.write("\n")

    clf = joblib.load(os.path.join(
            os.path.dirname(os.path.realpath(__file__)),
            'classifier/classifier.pkl'))


    Class_0 = []
    Class_1 = []
    prob = []

    for index, row in dataset.iterrows():
        res = clf.predict([row])
        if res == 0:
            if index in malware:
                Class_0.append(index)
            elif index in Class_1:
                Class_1.append(index)           
            else:
                print "Is ", index, " recognised?"
                designation = raw_input()

                if designation == "No":
                    Class_0.append(index)
                else:
                    Class_1.append(index)

    dataset['Type']  = 1                    
    dataset.loc[dataset.index.str.contains('|'.join(Class_0)), 'Type'] = 0

    print "\n"

    results = []

    results.append(collections.OrderedDict.fromkeys(dataset.index[dataset['Type'] == 0]))
    print (results)

    X = dataset.drop(['Type'], axis=1).values
    Y = dataset['Type'].values


    clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True)
    clf.fit(X, Y)
    joblib.dump(clf, 'classifier/classifier.pkl')

    output = collections.Counter(Class_0)

    print "Class_0; \n"
    f.write ("Class_0; \n")

    for key, value in output.items():    
        f.write(str(key) + " ; " + str(value) + "\n")
        print(str(key) + " ; " + str(value))

    print "\n"
    f.write ("\n") 

    output_1 = collections.Counter(Class_1)

    print "Class_1; \n"
    f.write ("Class_1; \n")

    for key, value in output_1.items():    
        f.write(str(key) + " ; " + str(value) + "\n")
        print(str(key) + " ; " + str(value))

    print "\n" 

    f.close()

我的新代码是相同的,但包含在几个嵌套循环中,以在文件夹中有要处理的文件时保持脚本运行,新代码(导致错误的代码)如下;

# -*- coding: utf-8 -*-

import csv
import pandas
import numpy as np
import sklearn.ensemble as ske
import re
import os
import time
import collections
import pickle
from sklearn.externals import joblib
from sklearn import model_selection, tree, linear_model, svm

# Our arrays which we'll store our process details in and then later print out data for
Class_0 = []
Class_1 = []
prob = []
results = []

# Open file to output our report too
timestr = time.strftime("%Y%m%d%H%M%S")

f = open(timestr + " output report.txt", 'w')
f.write(timestr + " output report\n")
f.write("\n")

count = len(os.listdir('.'))

while (count > 0):
    # Load dataset
    for filename in os.listdir('.'):
            if filename.endswith('.csv') and filename.startswith("processes_"):

                url = filename

                dataset = pandas.read_csv(url)
                dataset.set_index('Name', inplace = True)

                clf = joblib.load(os.path.join(
                        os.path.dirname(os.path.realpath(__file__)),
                        'classifier/classifier.pkl'))               

                for index, row in dataset.iterrows():
                    res = clf.predict([row])
                    if res == 0:
                        if index in Class_0:
                            Class_0.append(index)
                        elif index in Class_1:
                            Class_1.append(index)           
                        else:
                            print "Is ", index, " recognised?"
                            designation = raw_input()

                            if designation == "No":
                                Class_0.append(index)
                            else:
                                Class_1.append(index)

                dataset['Type']  = 1                    
                dataset.loc[dataset.index.str.contains('|'.join(Class_0)), 'Type'] = 0

                print "\n"

                results.append(collections.OrderedDict.fromkeys(dataset.index[dataset['Type'] == 0]))
                print (results)

                X = dataset.drop(['Type'], axis=1).values
                Y = dataset['Type'].values


                clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True)
                clf.fit(X, Y)
                joblib.dump(clf, 'classifier/classifier.pkl')

                os.remove(filename) 


output = collections.Counter(Class_0)

print "Class_0; \n"
f.write ("Class_0; \n")

for key, value in output.items():    
    f.write(str(key) + " ; " + str(value) + "\n")
    print(str(key) + " ; " + str(value))

print "\n"
f.write ("\n") 

output_1 = collections.Counter(Class_1)

print "Class_1; \n"
f.write ("Class_1; \n")

for key, value in output_1.items():    
    f.write(str(key) + " ; " + str(value) + "\n")
    print(str(key) + " ; " + str(value))

print "\n" 

f.close()

错误 (IndexError: index 1 is out of bounds for size 1) 引用了预测行 res = clf.predict([row])。据我所知,问题在于数据没有足够的“类”或标签类型(我要使用二进制分类器)?但是我之前一直在使用这种精确的方法(在嵌套循环之外),没有任何问题。

https://codeshare.io/Gkpb44 - 包含上述 .csv 文件的我的 .csv 数据的代码共享链接。

【问题讨论】:

    标签: python machine-learning classification svm


    【解决方案1】:

    问题是[row] 是一个长度为 1 的数组。您的程序尝试访问不存在的索引 1(索引以 0 开头)。看起来您可能想做res = clf.predict(row) 或再看看行变量。希望这会有所帮助。

    【讨论】:

      【解决方案2】:

      所以我已经意识到问题所在了。

      我创建了一种加载分类器的格式,然后使用 warm_start 我重新拟合数据以更新分类器以尝试模拟增量/在线学习。当我处理包含两种类型的类的数据时,这很有效。但是,如果数据只是正面的,那么当我重新拟合分类器时,它就会破坏它。

      现在我已经注释掉了以下内容;

      clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True)
      clf.fit(X, Y)
      joblib.dump(clf, 'classifier/classifier.pkl')
      

      解决了这个问题。展望未来,我可能会添加(又一个!)条件语句,看看我是否应该重新拟合数据。

      我很想删除这个问题,但是由于我在搜索过程中没有找到任何涵盖这个事实的内容,所以我想我会留下答案,以防有人发现他们有同样的问题。

      【讨论】:

        猜你喜欢
        • 2014-04-21
        • 2020-06-22
        • 1970-01-01
        • 2011-02-15
        • 2021-10-27
        • 2016-04-17
        • 2015-02-13
        • 1970-01-01
        相关资源
        最近更新 更多