【问题标题】:How to create features_train for Sklearn from two types of data如何从两种类型的数据中为 Sklearn 创建 features_train
【发布时间】:2017-12-13 06:54:07
【问题描述】:

我有一个颜色列表:

initialData = [u'black' u'black' u'white' u'powderblue'
 u'whitesmoke' u'black' u'cornflowerblue' u'powderblue' u'powderblue'
 u'goldenrod']

每种颜色代表选择该颜色的用户。我有每种颜色的标签,这意味着一种性别:

labels_train = [0 0 0 0 0 1 1 1 1 1]

0 表示颜色由女性选择,1 表示男性。我需要使用这些知识来预测性别,我是这样做的:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(initialData)
features_train = le.transform(initialData)
features_train = features_train.reshape(-1, 1)

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(features_train, labels_train)

但现在我想向我的initialData 添加更多数据。如果我不仅要使用颜色预测性别,还要使用生物来预测性别呢?对于这种情况,我应该如何更改/规范化我的 features_train?例如,如果我有另一个这样的列表:

initialData2 = [u'Hello, my name is Bob and I love to cook' u'happy mother' ...]

还有每个元素的标签:

labels_train2 = [1 0]

UPD

我的数据示例。这是一个 CSV,我使用列 link_colordescription

_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,gender,gender:confidence,profile_yn,profile_yn:confidence,created,description,fav_number,gender_gold,link_color,name,profile_yn_gold,profileimage,retweet_count,sidebar_color,text,tweet_coord,tweet_count,tweet_created,tweet_id,tweet_location,user_timezone
815719226,FALSE,finalized,3,10/26/15 23:24,male,1,yes,1,12/5/13 1:48,i sing my own rhythm.,0,,08C2C2,sheezy0,,pbs.twimg.com/profile_images/414342229096808449/fYvzqXN7_normal.png,0,FFFFFF,Robbie E Responds To Critics After Win Against Eddie Edwards In The #WorldTitleSeries t.co/NSybBmVjKZ,,110964,10/26/15 12:40,6.5873E+17,main; @Kan1shk3,Chennai
815719227,FALSE,finalized,3,10/26/15 23:30,male,1,yes,1,10/1/12 13:51,I'm the author of novels filled with family drama and romance.,68,,0084B4,DavdBurnett,,pbs.twimg.com/profile_images/539604221532700673/WW16tBbU_normal.jpeg,0,C0DEED,���It felt like they were my friends and I was living the story with them�۝ t.co/arngE0YHNO #retired #IAN1 t.co/CIzCANPQFz,,7471,10/26/15 12:40,6.5873E+17,,Eastern Time (US & Canada)
815719228,FALSE,finalized,3,10/26/15 23:33,male,0.6625,yes,1,11/28/14 11:30,louis whining and squealing and all,7696,,ABB8C2,lwtprettylaugh,,pbs.twimg.com/profile_images/657330418249658368/SBLCXdF7_normal.png,1,C0DEED,i absolutely adore when louis starts the songs it hits me hard but it feels good,,5617,10/26/15 12:40,6.5873E+17,clcncl,Belgrade
815719229,FALSE,finalized,3,10/26/15 23:10,male,1,yes,1,6/11/09 22:39,"Mobile guy.  49ers, Shazam, Google, Kleiner Perkins, Yahoo!, Sprint PCS, AirTouch, Air Force.  Stanford GSB, UVa.  Dad, Husband, Brother.  Golfer.",202,,0084B4,douggarland,,pbs.twimg.com/profile_images/259703936/IMG_8444_normal.JPG,0,C0DEED,Hi @JordanSpieth - Looking at the url - do you use @IFTTT?!  Don't typically see an advanced user on the @PGATOUR! t.co/H68ou5PE9L,,1693,10/26/15 12:40,6.5873E+17,"Palo Alto, CA",Pacific Time (US & Canada)
815719230,FALSE,finalized,3,10/27/15 1:15,female,1,yes,1,4/16/14 13:23,Ricky Wilson The Best FRONTMAN/Kaiser Chiefs The Best BAND Xxxx Thank you Kaiser Chiefs for an incredible year of gigs and memories to cherish always :) Xxxxxxx,37318,,3B94D9,WilfordGemma,,pbs.twimg.com/profile_images/564094871032446976/AOfpk-mr_normal.jpeg,0,0,Watching Neighbours on Sky+ catching up with the Neighbs!! Xxx _���_���_���_���_ُ�_�ԍ_ُ� Xxx,,31462,10/26/15 12:40,6.5873E+17,,

【问题讨论】:

标签: python-2.7 machine-learning scikit-learn


【解决方案1】:

由于声誉低,我无法发表评论。我想更详细地了解您的数据集的结构。让我试着把它放在一个表格格式中。您的包含数据的表可能看起来像这样吗?

labels_train   color             bio         

    0         'black'            something
    0         'black'            something
    0         'black'            something
    0         'black'            something
    0         'whitesmoke'       something
    1         'black'            something
    1         'cornflowerblue'   something
    1         'powderblue'       something
    1         'powderblue'       something
    1         'powderblue'       something
    1             -             'Hello, my name is Bob and I love to..
    0             -             'happy mother'

【讨论】:

  • 嗨!我知道所有用户的颜色和简介,但是是的,我的数据看起来像您的示例
  • 介意发布更多数据吗?我想要5行。数据中有空列吗?
  • 还有更多你应该知道的。您不能在具有字符的数据上拟合分类器,它们应该都是整数。你需要更彻底地了解你在做什么。在这种情况下,颜色是一个类别。您需要为其创建虚拟变量。 bio 是一串字符。您无法将包含文本行的列放入 ML 分类器。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2016-07-13
  • 2014-10-26
  • 2021-12-02
  • 2017-04-07
  • 2023-01-21
  • 1970-01-01
相关资源
最近更新 更多