【问题标题】:sklearn oneclass svm KeyErrorsklearn oneclass svm KeyError
【发布时间】:2020-05-14 22:05:15
【问题描述】:

我的数据集是一组恶意软件和良性系统调用,我对其进行了预处理,现在看起来像这样

NtQueryPerformanceCounter
NtProtectVirtualMemory
NtProtectVirtualMemory
NtQuerySystemInformation
NtQueryVirtualMemory
NtQueryVirtualMemory
NtProtectVirtualMemory
NtOpenKey
NtOpenKey
NtOpenKey
NtQuerySecurityAttributesToken
NtQuerySecurityAttributesToken
NtQuerySystemInformation
NtQuerySystemInformation
NtAllocateVirtualMemory
NtFreeVirtualMemory

现在我使用tfidf 来提取特征,然后使用ngram 来制作它们的序列

from __future__ import print_function

import numpy as np
import pandas as pd
from time import time
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils import shuffle
from sklearn.svm import OneClassSVM

nGRAM1 = 8
nGRAM2 = 10
weight = 4

main_corpus_MAL = []
main_corpus_target_MAL = []
main_corpus_BEN = []
main_corpus_target_BEN = []

my_categories = ['benign', 'malware']

# feeding corpus the testing data

print("Loading system call database for categories:")
print(my_categories if my_categories else "all")

import glob
import os

malCOUNT = 0
benCOUNT = 0
for filename in glob.glob(os.path.join('C:\\Users\\alika\\Documents\\testingSVM\\sysMAL', '*.txt')):
    fMAL = open(filename, "r")
    aggregate = ""
    for line in fMAL:
        linea = line[:(len(line)-1)]
        aggregate += " " + linea
    main_corpus_MAL.append(aggregate)
    main_corpus_target_MAL.append(1)
    malCOUNT += 1

for filename in glob.glob(os.path.join('C:\\Users\\alika\\Documents\\testingSVM\\sysBEN', '*.txt')):
    fBEN = open(filename, "r")
    aggregate = ""
    for line in fBEN:
        linea = line[:(len(line) - 1)]
        aggregate += " " + linea
    main_corpus_BEN.append(aggregate)
    main_corpus_target_BEN.append(0)
    benCOUNT += 1

# weight as determined in the top of the code
train_corpus = main_corpus_BEN[:(weight*len(main_corpus_BEN)//(weight+1))]
train_corpus_target = main_corpus_target_BEN[:(weight*len(main_corpus_BEN)//(weight+1))]
test_corpus = main_corpus_MAL[(len(main_corpus_MAL)-(len(main_corpus_MAL)//(weight+1))):]
test_corpus_target = main_corpus_target_MAL[(len(main_corpus_MAL)-len(main_corpus_MAL)//(weight+1)):]

def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6

# size of datasets
train_corpus_size_mb = size_mb(train_corpus)
test_corpus_size_mb = size_mb(test_corpus)

print("%d documents - %0.3fMB (training set)" % (
    len(train_corpus_target), train_corpus_size_mb))
print("%d documents - %0.3fMB (test set)" % (
    len(test_corpus_target), test_corpus_size_mb))
print("%d categories" % len(my_categories))
print()
print("Benign Traces: "+str(benCOUNT)+" traces")
print("Malicious Traces: "+str(malCOUNT)+" traces")
print()

print("Extracting features from the training data using a sparse vectorizer...")
t0 = time()

vectorizer = TfidfVectorizer(ngram_range=(nGRAM1, nGRAM2), min_df=1, use_idf=True, smooth_idf=True) ##############

analyze = vectorizer.build_analyzer()

X_train = vectorizer.fit_transform(train_corpus)

duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, train_corpus_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_train.shape)
print()

print("Extracting features from the test data using the same vectorizer...")
t0 = time()
X_test = vectorizer.transform(test_corpus)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, test_corpus_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_test.shape)
print()

输出是:

Loading system call database for categories:
['benign', 'malware']
177 documents - 45.926MB (training set)
44 documents - 12.982MB (test set)
2 categories

Benign Traces: 72 traces
Malicious Traces: 150 traces

Extracting features from the training data using a sparse vectorizer...
done in 7.831695s at 5.864MB/s
n_samples: 177, n_features: 603170

Extracting features from the test data using the same vectorizer...
done in 1.624100s at 7.993MB/s
n_samples: 44, n_features: 603170

现在学习部分我尝试使用sklearn OneClassSVM

print("==================\n")
print("Training: ")
classifier = OneClassSVM(kernel='linear', gamma='auto')
classifier.fit(X_test)

fraud_pred = classifier.predict(X_test)

unique, counts = np.unique(fraud_pred, return_counts=True)
print (np.asarray((unique, counts)).T)

fraud_pred = pd.DataFrame(fraud_pred)
fraud_pred= fraud_pred.rename(columns={0: 'prediction'})
main_corpus_target = pd.DataFrame(main_corpus_target)
main_corpus_target= main_corpus_target.rename(columns={0: 'Category'})

这是fraud_predmain_corpus_target的输出

prediction
0   1
1  -1
2   1
3   1
4   1
5  -1
6   1
7  -1
...
30 rows * 1 column
====================
Category
0   1
1   1
2   1
3   1
4   1
...
217 0
218 0
219 0
220 0
221 0
222 rows * 1 column

但是当我尝试计算TP,TN,FP,FN:

##Performance check of the model

TP = FN = FP = TN = 0
for j in range(len(main_corpus_target)):
    if main_corpus_target['Category'][j]== 0 and fraud_pred['prediction'][j] == 1:
        TP = TP+1
    elif main_corpus_target['Category'][j]== 0 and fraud_pred['prediction'][j] == -1:
        FN = FN+1
    elif main_corpus_target['Category'][j]== 1 and fraud_pred['prediction'][j] == 1:
        FP = FP+1
    else:
        TN = TN +1
print (TP,  FN,  FP,  TN)

我收到此错误:

KeyError                                  Traceback (most recent call last)
<ipython-input-32-1046cc75ba83> in <module>
      7     elif main_corpus_target['Category'][j]== 0 and fraud_pred['prediction'][j] == -1:
      8         FN = FN+1
----> 9     elif main_corpus_target['Category'][j]== 1 and fraud_pred['prediction'][j] == 1:
     10         FP = FP+1
     11     else:

c:\users\alika\appdata\local\programs\python\python36\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
   1069         key = com.apply_if_callable(key, self)
   1070         try:
-> 1071             result = self.index.get_value(self, key)
   1072 
   1073             if not is_scalar(result):

c:\users\alika\appdata\local\programs\python\python36\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
   4728         k = self._convert_scalar_indexer(k, kind="getitem")
   4729         try:
-> 4730             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
   4731         except KeyError as e1:
   4732             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 30

1) 我知道错误是因为它试图访问不在字典中的键,但我不能只在 fraud_pred 中插入一些数字来处理这个问题,有什么建议吗??
2)我做错了什么他们不匹配?
3)我想将结果与其他一类分类算法进行比较,由于我的方法,我可以使用的最好的算法是什么??

【问题讨论】:

    标签: machine-learning scikit-learn tf-idf n-gram one-class-classification


    【解决方案1】:

    编辑:在计算指标之前:

    您可以将 fit 和 predict 函数更改为:

    fraud_pred = classifier.fit_predict(X_test)
    

    另外,你的 main_corpus_target 和 X_test 应该有相同的长度,你可以把代码放在你创建main_corpus_target的地方吗?

    它在benCOUNT += 1 之后创建了它: main_corpus_target = main_corpus_target_MALmain_corpus_target.extend(main_corpus_target_BEN)

    这意味着你正在创建一个包含 MAL 和 BEN 的 main_corpus_target,你得到的错误是:

    ValueError: Found input variables with inconsistent numbers of samples: [30, 222]
    

    fraud_pred 的样本数为 30,因此您应该使用 30 的数组来评估它们。main_corpus_target 包含 222。

    查看您的代码,我看到您想要评估与 test_corpus X_test = vectorizer.transform(test_corpus) 相关的 X_test。最好将您的结果与 test_corpus_target 进行比较,它是数据集的目标变量,长度也为 30。 您拥有的这两行应该输出相同的长度:

    test_corpus = main_corpus_MAL[(len(main_corpus_MAL)-(len(main_corpus_MAL)//(weight+1))):]
    test_corpus_target = main_corpus_target_MAL[(len(main_corpus_MAL)-len(main_corpus_MAL)//(weight+1)):]
    

    请问您为什么要自己计算TP、TN...?

    你有一个更快的选择:

    1. 转换欺诈_pred系列,将-1替换为0。
    2. 使用sklearn offers的混淆矩阵函数。
    3. 使用 ravel 提取混淆矩阵的值。

    一个例子,将-1转换为0后:

    from sklearn.metrics import confusion_matrix
    tn, fp, fn, tp = confusion_matrix(fraud_pred, main_corpus_target['Category'].values).ravel()
    

    另外,如果您使用的是最新的 pandas 版本:

    from sklearn.metrics import confusion_matrix
    tn, fp, fn, tp = confusion_matrix(fraud_pred, main_corpus_target['Category'].to_numpy()).ravel()
    

    【讨论】:

    • 它在benCOUNT += 1 main_corpus_target = main_corpus_target_MAL main_corpus_target.extend(main_corpus_target_BEN) @Noki 之后创建了它
    • @AliCross 看着你的代码,我看到你想要评估与 test_corpus 相关的 X_test(你对 test_corpus 输出 X_test 进行矢量化)。将您的结果与数据集的目标变量 test_corpus_target 进行比较会更好,不是吗?这两个具有相同的长度,因此您可以公平地比较算法。
    • 我采纳了您的建议并进行了更正,谢谢。但我仍然收到该错误,但换一种说法:ValueError: Found input variables with inconsistent numbers of samples: [30, 222]。我不知道如何解决这个问题
    • @AliCross欺诈_pred的样本数是30,所以你应该用30的数组来评估它们。main_corpus_target包含222。请问可以使用test_corpus_target中的数据吗?
    • 好吧,我想你会像我一样感到困惑,因为当我运行 test_corpus_target = pd.DataFrame(test_corpus_target) 然后 test_corpus_target= test_corpus_target.rename(columns={0: 'Category'}) 我得到 102 行!当我运行train_corpus_target = pd.DataFrame(train_corpus_target) 然后train_corpus_target= train_corpus_target.rename(columns={0: 'Category'}) 我得到57 行! @Noki
    猜你喜欢
    • 2017-04-21
    • 2018-08-11
    • 2015-04-25
    • 2021-03-17
    • 2016-05-11
    • 2018-09-25
    • 2015-07-04
    • 2020-07-16
    • 2013-11-13
    相关资源
    最近更新 更多