1.文章说明
本系列文章都是自己学习《python机器学习及实战》这本书时所做的一些笔记而已,仅为学习作参考。
2.数据集地址:
数据地址是书中给出的数据下载地址:
https://pan.baidu.com/s/1dENAUTr#list/path=%2F&parentPath=%2FPython%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E5%8F%8A%E5%AE%9E%E8%B7%B5
3.良/恶性乳腺癌肿瘤预测学习
3.1获取数据
import pandas as pd #获取数据 #获取训练数据 df_train = pd.read_csv(r"../Datasets/Breast-Cancer/breast-cancer-train.csv") # print(df_train.head(5))#查看训练数据前5行 """ Unnamed: 0 Clump Thickness Cell Size Type 0 163 1 1 0 1 286 10 10 1 2 612 10 10 1 3 517 1 1 0 4 464 1 1 0 """ #获取测试数据 df_test = pd.read_csv(r"../Datasets/Breast-Cancer/breast-cancer-test.csv") # print(df_test.head(5)) #查看测试数据前5行 """ Unnamed: 0 Clump Thickness Cell Size Type 0 158 1 2 0 1 499 1 1 0 2 396 1 1 0 3 155 5 5 1 4 321 1 1 0 """3.2选取特征,构建测试集中的正负分类样本
#选取‘Clump Thickness’和‘Cell Size’作为特征,构建测试集的正负分类样本 df_test_negative = df_test.loc[df_test['Type'] == 0][['Clump Thickness','Cell Size']] # print(df_test_negative.head(5)) """ Clump Thickness Cell Size 0 1 2 1 1 1 2 1 1 4 1 1 5 1 1 """ df_test_positive = df_test.loc[df_test['Type'] == 1][['Clump Thickness','Cell Size']] # print(df_test_positive.head(5)) """ Clump Thickness Cell Size 3 5 5 7 6 6 8 4 10 9 3 3 11 10 3 """3.3绘图
#绘制良性肿瘤样本点,标记为红色的圈 plt.scatter(df_test_negative['Clump Thickness'],df_test_negative['Cell Size'],marker='o',s=200,c='red') #绘制恶性肿瘤样本点,标记为黑色的× plt.scatter(df_test_positive['Clump Thickness'],df_test_positive['Cell Size'],marker='x',s=150,c='black') #设置x轴的标签 plt.xlabel('Clump Thickness') #设置y轴的标签 plt.ylabel('Cell Size') plt.show()图片如下所示:
3.4为上图添加一条随机直线
import numpy as np #随机采样直线的截距 intercept = np.random.random([1]) # print(intercept) """ [0.61438353] """ #随机采样直线的系数 coef = np.random.random([2]) # print(coef) """ [0.92360088 0.98457231] """ #x的值 lx = np.arange(0,12) #直线表达式 ly = (-intercept-lx*coef[0])/coef[1] #绘制上面的产生的随机直线 plt.plot(lx,ly,c='yellow') #绘制良性肿瘤样本点,标记为红色的圈 plt.scatter(df_test_negative['Clump Thickness'],df_test_negative['Cell Size'],marker='o',s=200,c='red') #绘制恶性肿瘤样本点,标记为黑色的× plt.scatter(df_test_positive['Clump Thickness'],df_test_positive['Cell Size'],marker='x',s=150,c='black') #设置x轴的标签 plt.xlabel('Clump Thickness') #设置y轴的标签 plt.ylabel('Cell Size') #显示图片 plt.show()图片如下所示:
3.5采用sklearn的逻辑斯蒂回归分类器学习模型
模型使用方法:https://blog.csdn.net/xiaoQL520/article/details/80426374
3.5.1使用前10条训练样本学习直线的系数和截距并绘图
from sklearn.linear_model import LogisticRegression lr = LogisticRegression() #使用前10条训练样本学习直线的系数和截距 lr.fit(df_train[['Clump Thickness','Cell Size']][:10],df_train['Type'][:10]) # print('Testing accuracy (10 training samples):',lr.score(df_test[['Clump Thickness','Cell Size']],df_test['Type'])) """ Testing accuracy (10 training samples): 0.8685714285714285 """ #训练后的截距 intercept = lr.intercept_ # print(intercept) """ [-1.51522787] """ #训练后的斜率 coef = lr.coef_ # print(coef) """ [[-0.10721332 0.48314152]] """ #分类平面为lx * coef[0] + ly * coef[1] + intercept = 0,映射到2维平面上的直线如下 ly = (-intercept - lx * coef[0][0]) / coef[0][1] plt.plot(lx,ly,c='green') #绘制良性肿瘤样本点,标记为红色的圈 plt.scatter(df_test_negative['Clump Thickness'],df_test_negative['Cell Size'],marker='o',s=200,c='red') #绘制恶性肿瘤样本点,标记为黑色的× plt.scatter(df_test_positive['Clump Thickness'],df_test_positive['Cell Size'],marker='x',s=150,c='black') #设置x轴的标签 plt.xlabel('Clump Thickness') #设置y轴的标签 plt.ylabel('Cell Size') #显示图片 plt.show()图片结果如下:
3.5.2使用所有训练样本学习直线的系数和截距并绘图
lr = LogisticRegression() #使用所有的训练样本学习直线的系数和截距 lr.fit(df_train[['Clump Thickness','Cell Size']],df_train['Type']) print('Testing accuracy (all training samples):',lr.score(df_test[['Clump Thickness','Cell Size']],df_test['Type'])) """ Testing accuracy (all training samples): 0.9371428571428572 """ #训练后的截距 intercept = lr.intercept_ print(intercept) """ [-4.67611309] """ #训练后的斜率 coef = lr.coef_[0,:] print(coef) """ [0.59071861 0.7498354 ] """ #分类平面为lx * coef[0] + ly * coef[1] + intercept = 0,映射到2维平面上的直线如下 ly = (-intercept - lx * coef[0]) / coef[1] plt.plot(lx,ly,c='blue') #绘制良性肿瘤样本点,标记为红色的圈 plt.scatter(df_test_negative['Clump Thickness'],df_test_negative['Cell Size'],marker='o',s=200,c='red') #绘制恶性肿瘤样本点,标记为黑色的× plt.scatter(df_test_positive['Clump Thickness'],df_test_positive['Cell Size'],marker='x',s=150,c='black') #设置x轴的标签 plt.xlabel('Clump Thickness') #设置y轴的标签 plt.ylabel('Cell Size') #显示图片 plt.show()图片结果如下: