【发布时间】:2017-03-12 19:15:48
【问题描述】:
我是 Python 新手,正在尝试在 pandas 数据帧上使用 sklearn 执行线性回归。这就是我所做的:
首先我标记我的数据框
# imports
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.preprocessing import Imputer
from sklearn.linear_model import LogisticRegression
col=['Id','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion',
'Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']
# read data into a DataFrame
data = pd.read_csv("breast_cancer.txt",header=None, prefix="V")
data.columns = col
d = pd.DataFrame(data,columns=col)
第二个我用对应特征的平均值填充了所有缺失值
list_of_means = d.mean()
# filling missing values with mean
for i in range (2,10):
for j in range(699):
if d.iloc[j, i] == "?":
d.iloc[j, i] = round(list_of_means[i],0)
d['Type'] = 'benign'
# map Type to 0 if class is 2 and 1 if class is 4
d['Type'] = d.Class.map({2:0, 4:1})
X = d[['Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion',
'Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses']]
第三个我创建了新列并命名为 Type 以将类 2 映射到类型 0 并将类 4 映射到类型 1
y=[['Type']]
# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X.reshape(X.shape[0], 1), y)
# check the accuracy on the training set
score = model.score(X, y)
#calculate correlation matrix
corMat = DataFrame(data.iloc[:,2:10].corr())
print 'correlation matrix'
print(corMat)
print score
print X.head()
但是我收到了这个错误 逻辑回归 ValueError:发现样本数量不一致的输入变量: 在我做了一些搜索之后,我发现 sklearn 需要(行号,列号)的数据形状,因此适合方法
model = model.fit(X.reshape(X.shape[0], 1), y)
正如你在上面看到的,但我收到了新的错误提示
返回对象.__getattribute__(self, name) AttributeError:“DataFrame”对象没有属性“reshape”
数据集特征
# Attribute Domain
-- -----------------------------------------
1. Sample code number id number
2. Clump Thickness 1 - 10
3. Uniformity of Cell Size 1 - 10
4. Uniformity of Cell Shape 1 - 10
5. Marginal Adhesion 1 - 10
6. Single Epithelial Cell Size 1 - 10
7. Bare Nuclei 1 - 10
8. Bland Chromatin 1 - 10
9. Normal Nucleoli 1 - 10
10. Mitoses 1 - 10
11. Class: (2 for benign, 4 for malignant)
PS:我注意到大量初学者问题在 stackoverflow 中被否决。请考虑这样一个事实,对于专家用户来说似乎很明显的事情可能需要初学者几天才能弄清楚。在按下向下箭头时请谨慎行事,否则会损害此讨论社区的活力。
【问题讨论】:
-
(1)如果Class有2和4以外的值,映射后会有NaN,最好去掉,(2)y=[['Type' ]] --> y=d[['Type']], (3) 你不需要重塑 X。
-
@vpekar 类只有 2 和 4 这里是链接到数据集 archive.ics.uci.edu/ml/machine-learning-databases/…
-
如果您更正了 y (
y=d[['Type']]) 的分配并像这样拟合模型,您的代码工作正常:model = LogisticRegression().fit(X, y)。此外,您不应该在训练集上进行评估。 -
@vpekar 感谢您的帮助我错过了分配 y
标签: python pandas scikit-learn