【发布时间】:2020-10-19 03:28:27
【问题描述】:
我做了以下事情:
- 拆分测试和训练数据。
- 确保测试数据和训练数据之间没有共同点。
- 进行放大以使训练数据具有相同数量的“是”和“否”。
但是,我总是得到 1.0 的最佳参数。这是为什么呢?
这是完整的代码:
from sklearn.tree import DecisionTreeClassifier
from random import randrange
import numpy as np
import seaborn as sns
import pandas as pd
import pandas.util.testing as tm
import matplotlib.pyplot as plt
from sklearn import preprocessing
参考(用于转换文本>数字):
https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/discussion/86957
url = "https://raw.githubusercontent.com/furkan-ozbudak/machine-learning/master/input.csv"
# Import data
dataFrame = pd.read_csv(url)
# Drop non-priority features/columns
dataFrame = dataFrame.drop(columns=['Education', 'EmployeeCount', 'NumCompaniesWorked', 'Over18'])
features = [
'Attrition',
'BusinessTravel',
'Department',
'EducationField',
'Gender',
'JobRole',
'MaritalStatus',
'OverTime'
]
stringToNumericDict = {
"Yes":1, "No":0, "Y":1, "N":0,
"Non-Travel":0, "Travel_Frequently":2, "Travel_Rarely": 3,
"Research & Development": 2, "Human Resources":"1", "Sales": 3,
"Life Sciences": 2, "Medical":4, "Other":5, "Marketing": 3, "Technical Degree":6,
"Male": 2, "Female":1,
"Laboratory Technician": 3, "Healthcare Representative": 1, "Manufacturing Director":5,
"Sales Executive": 8, "Research Scientist": 7, "Research Director": 6,"Sales Representative": 9,
"Manager": 4,
"Married": 2, "Divorced": 1, "Single": 3,
}
# Convert Alphabets > Numeric
for feature in features:
dataFrame[feature].replace(stringToNumericDict, inplace=True)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
y = dataFrame['Attrition']
在这里,我将数据拆分以进行测试和训练。
X_train, X_test, y_train, y_test = train_test_split(dataFrame, y, test_size=0.3, random_state=1)
在这里,我正在对训练数据进行上采样,因为它具有多数否。在这一步之后,是和否都是 50%。
from sklearn.utils import resample
df_majority = X_train[X_train['Attrition']==0] # 0 = No
df_minority = X_train[X_train['Attrition']==1]
print("Count of 'No': %d(majority), Count of Yes: %d(minority)" % (len(df_majority), len(df_minority)))
# Upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=869, # to match majority class
random_state=50) # reproducible results
# Combine majority class with upsampled minority class
X_train = pd.concat([df_majority, df_minority_upsampled])
# Display new class counts
X_train['Attrition'].value_counts()
# Change y_train in because X_train changed
y_train = X_train['Attrition'].values
sns.countplot(X_train['Attrition'])
all_cols = list(X_train.columns)
X_train.merge(X_test.drop_duplicates(subset=all_cols), how='inner')
# Train once
参考:https://scikit-learn.org/stable/modules/tree.html
在训练数据中训练模型,在测试数据中测试数据:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
# predict the class of samples
y_predict = clf.predict(X_test)
#clf.score(X_test, y_test)
from sklearn.metrics import confusion_matrix, classification_report
confusion_matrix(y_test, y_predict)
accuracy_score(y_test,y_predict)*100
classification_report(y_test, y_predict)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predict)
from sklearn.metrics import precision_score
precision_score(y_test, y_predict)
from sklearn.metrics import f1_score
f1_score(y_test, y_predict)
【问题讨论】:
-
这个问题确实属于ai.stackexchange.com
-
谢谢@navule,感谢您的分析。这个问题现在已经回答了。
标签: python scikit-learn decision-tree