【问题标题】:How to Prepare Data for DecisionTreeClassifier Scikit如何为 DecisionTreeClassifier Scikit 准备数据
【发布时间】:2015-07-11 13:14:06
【问题描述】:

我在 csv 中有以下数据,顶行表示列标题并且数据被索引,所有数据都被离散化。我需要制作一个决策树分类器 Model 。有人可以指导我吗?

    ,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,"(16.927, 41.333]", State-gov,"(10806.885, 504990]", Bachelors,"(12, 16]", Never-married, Adm-clerical, Not-in-family, White, Male,"(0, 5000]",,"(30, 50]", United-States, <=50K
1,"(41.333, 65.667]", Self-emp-not-inc,"(10806.885, 504990]", Bachelors,"(12, 16]", Married-civ-spouse, Exec-managerial, Husband, White, Male,,,"(0, 30]", United-States, <=50K
2,"(16.927, 41.333]", Private,"(10806.885, 504990]", HS-grad,"(8, 12]", Divorced, Handlers-cleaners, Not-in-family, White, Male,,,"(30, 50]", United-States, <=50K
3,"(41.333, 65.667]", Private,"(10806.885, 504990]", 11th,"(-1, 8]", Married-civ-spouse, Handlers-cleaners, Husband, Black, Male,,,"(30, 50]", United-States, <=50K
4,"(16.927, 41.333]", Private,"(10806.885, 504990]", Bachelors,"(12, 16]", Married-civ-spouse, Prof-specialty, Wife, Black, Female,,,"(30, 50]", Cuba, <=50K

到目前为止我的方法:

df, filen = decision_tree.readCSVFile("../Data/discretized.csv")
print df[:3]
newdf = decision_tree.catToInt(df)
print newdf[:3]
model = DecisionTreeClassifier(random_state=0)
print cross_val_score(model, newdf, newdf[:,14], cv=10)

catToInt 函数:

def catToInt(df):
    mapper={}
    categorical_list = list(df.columns.values)
    newdf = pd.DataFrame(columns=categorical_list)
    #Converting Categorical Data
    for x in categorical_list:
        mapper[x]=preprocessing.LabelEncoder()
    for x in categorical_list:
        someinput = df.__getattr__(x)
        newcol = mapper[x].fit_transform(someinput)
        newdf[x]= newcol
    return newdf

错误:

        print cross_val_score(model, newdf, newdf[:,14], cv=10)
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 1787, in __getitem__
    return self._getitem_column(key)
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 1794, in _getitem_column
    return self._get_item_cache(key)
  File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 1077, in _get_item_cache
    res = cache.get(item)
TypeError: unhashable type

所以我能够将分类数据转换为 int。但我觉得我在下一步中遗漏了一些东西。

【问题讨论】:

  • 我假设
  • 哦是吗?因此,当您说数值时,您的意思是:例如,我有 4 个国家/地区:“美国”、“英格兰”、“加拿大”、“印度”,然后我将其转换为 1、2、3、4 —— 像这样吗? ?
  • 对于 scikit-learn,您应该使用 OneHotEncoder 对分类变量进行编码。
  • 评论很有帮助,但我还是卡住了,你能看看我修改后的问题吗? @AndreasMueller
  • 我认为您对 newdf 的索引是错误的。

标签: python numpy scikit-learn decision-tree


【解决方案1】:

这是我通过遵循上面的 cmets 和更多搜索得到的解决方案。我得到了预期的结果,但我知道会有更精致的方法来做到这一点。

from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
import pandas as pd
from sklearn import preprocessing
def main():
    df, _ = readCSVFile("../Data/discretized.csv")
    newdf, classl = catToInt(df)
    model = DecisionTreeClassifier()
    print cross_val_score(model, newdf, classl, cv=10)


def readCSVFile(filepath):
    df = pd.read_csv(filepath, index_col=0)
    (_, _, sufix) = filepath.rpartition('\\')
    (prefix, _, _) =sufix.rpartition('.')
    print "csv read and converted to dataframe !!"
    # df['class'] = df['class'].apply(replaceLabel)
    return df, prefix

def catToInt(df):
    # replace the Nan with "NA" which acts as a unique category
    df.fillna("NA", inplace=True)
    mapper={}

    # make list of all column headers 
    categorical_list = list(df.columns.values)

    #exclude the class column
    categorical_list.remove('class')
    newdf = pd.DataFrame(columns=categorical_list)

    #Converting Categorical Data to integer labels
    for x in categorical_list:
        mapper[x]=preprocessing.LabelEncoder()
    for x in categorical_list:
        newdf[x]= mapper[x].fit_transform(df.__getattr__(x))

    # make a class series encoded : 
    le = preprocessing.LabelEncoder()
    myclass = le.fit_transform(df.__getattr__('class'))

   #newdf is the dataframe with all columns except classcoumn and myclass is the class column 
    return newdf, myclass

main()

上面的 cmets 以外的一些链接对我有帮助:

  1. http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/
  2. http://biggyani.blogspot.com/2014/08/using-onehot-with-categorical.html

输出:

csv read and converted to dataframe !!
[ 0.83418628  0.83930399  0.83172979  0.82804504  0.83930399  0.84254709
  0.82985258  0.83022732  0.82428835  0.83678067]

它可能会帮助像我这样的 sklearn 新手用户。 欢迎提出建议/编辑和更好的答案。

【讨论】:

    猜你喜欢
    • 2015-04-07
    • 2017-05-26
    • 2015-10-22
    • 2016-07-15
    • 2016-08-12
    • 2015-08-24
    • 2012-12-06
    • 2014-05-17
    • 2022-01-08
    相关资源
    最近更新 更多