【发布时间】:2021-04-27 19:57:01
【问题描述】:
我在读取 .csv 文件的列时遇到问题。我有这个代码:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# Importing the dataset
dataset = pd.read_csv('D:/CTU/ateroskleroza/development/results_output6.csv')
print(dataset.head())
X = dataset.iloc[:, 2:16].values
y = dataset.iloc[:, 0].values
# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
classifier = make_pipeline(StandardScaler(), SVC(gamma='auto'))
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# Generating accuracy, precision, recall and f1-score
target_names = ['Progressive','Stable']
print(classification_report(y_test, y_pred, target_names=target_names))
.csv 看起来像这样:
根据图片的名称,它们有一些列,另一些则与 Nan。问题是当我尝试执行此代码时出现此错误:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
那么我怎样才能忽略 Nan 而只使用数字呢? (我不想删除空列,执行时忽略 Nan)。
【问题讨论】:
-
你需要为它制定一个策略,可以是
df.fillna(0.0)左右。 -
@simpleApp 但是我的结果会改变,对吧?因为我给 Nan 赋值,如果我用这个结果进行训练,它们会影响最终结果
-
是或否。只要有 nan 值,就需要一些攻击计划。如果它太稀疏,你要么删除它,要么用某个值替换它。
标签: python scikit-learn missing-data