【问题标题】:Taking columns of different type as training dataset将不同类型的列作为训练数据集
【发布时间】:2018-08-05 22:42:00
【问题描述】:

我之前仅将一列(字符串类型数据)作为我的火车集,我想将另一对应列(浮点类型的金额列)与详细信息列一起考虑作为火车集。 在金额列中,负值表示借方,正值表示贷方。 我该如何继续,我尝试将两列附加在一起,但我 必须将浮点类型数量转换为字符串类型 在我的数据集中有任何意义。 我想包括 Amount 列来检查机器是否可以学习变化,这在这种情况下非常重要。 提前致谢。

Details                    |Amount               |Category
-------------------------------------------------------------                                
Tanishq Jwellery Bangalore |-990                 |jwellery
ODESK***BAL-28APR13        |240                  |Others
AEGON RELIGARE LIFE IN     |456                  |Others
INTERNET PAYMENT #999999   |-250                 |Transfer in for Card Payment
WWW.VISTAPRINT.IN          |245                  |Print
Khazana Jwellery           |-9000                |jwellery
INTERNET PAYMENT #999999   |785                  |Transfer in for Card Payment
Indian Oil                 |344                  |Fuel
Touch foot wear            |-782                 |Clothing

我的部分脚本:

import pandas as pd
import numpy as np
import scipy as sp
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
import time
import matplotlib.pyplot as plt  
from sklearn.model_selection import train_test_split 

# TRAIN DATA
data= pd.read_csv('ds1.csv', delimiter=',',usecols=['Details','Amount','Category'],encoding='utf-8')
data=data[data.Category !="Others"]

target_one=data['Category']
target_list=data['Category'].unique()

# TEST DATASET
test_data=pd.read_csv('ds2.csv', delimiter='\t',usecols=['Details','Amount','Category'],encoding='utf-8')

x_train, y_train = (data.Details, data.Category )
x_test, y_test = (test_data.Details, test_data.Category)

vect = CountVectorizer(ngram_range=(1,2))
X_train = vect.fit_transform(x_train)

X_test = vect.transform(x_test)
start = time.clock()

mnb = MultinomialNB(alpha =0.13)
mnb.fit(X_train,y_train)

result= mnb.predict(X_test)
print (time.clock()-start)

accuracy_score(result,y_test)

【问题讨论】:

    标签: python machine-learning scikit-learn data-science text-classification


    【解决方案1】:

    如果您只想将“数量”列堆叠到使用CountVectorizer 获得的文本特征矩阵中,只需在拟合MultinomialNB 之前执行此操作:

    import numpy as np
    
    X_amount = data["Amount"].as_matrix().reshape(-1, 1)
    X_train = X_train.toarray()
    X_train = np.hstack((X_train, X_amount))
    X_test_amount = test_data["Amount"].as_matrix().reshape(-1, 1)
    X_test = X_test.toarray()
    X_test = np.hstack((X_test, X_test_amount)) 
    

    或者如果你想继续处理 X_train 的稀疏矩阵:

    import scipy as sp
    
    X_amount = data["Amount"].as_matrix().reshape(-1, 1)
    X_train = sp.sparse.hstack((X_train, X_amount))
    X_test_amount = test_data["Amount"].as_matrix().reshape(-1, 1)
    X_test = sp.sparse.hstack((X_test, X_test_amount)) 
    

    但是,我认为您最终会得到ValueError: Input X must be non-negative,因为MultinomialNB 旨在用于非负特征值...

    【讨论】:

    • 感谢@arthur,但我收到如下错误:ValueError: setting an array element with a sequence.(当试图将其与分类器匹配时)
    • @Vichu 哦,是的,抱歉,这是因为矢量化器给出的 X_train 是稀疏格式,请参阅我的编辑
    • 非常感谢@arthur,对不起另一个问题,当我将countvectorised输出转换为数组时出现内存错误(X_train.toarray()
    • @Vichu 然后继续处理稀疏矩阵(我的回答中的第二个选项)。
    • 谢谢@arthur,可能我必须尝试更改分类器,因为 MNB 无法处理负值。
    猜你喜欢
    • 2020-02-11
    • 1970-01-01
    • 2019-08-25
    • 1970-01-01
    • 2019-10-12
    • 2017-09-10
    • 1970-01-01
    • 2019-10-17
    • 2019-05-01
    相关资源
    最近更新 更多