【发布时间】:2016-01-15 01:32:55
【问题描述】:
我想用 100000 对一些标签(10 个类别)进行分类。但数据存在极端不平衡性,例如,两个类别各占整体数据的 30%,而有些类别约为 0.01%。因此我使用 lr = LogisticRegression(class_weight="auto") 而不是 lr = LogisticRegression()。我发现我的准确率和召回率测量执行得更差(准确率:78% 召回率:64% 到准确率:62% 召回率:57%),添加 class_weight="auto" 后,这是常见的还是我做错了什么?
# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
import re
import random
from random import randint
import csv
import dask.dataframe as dd
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import Imputer
lr = LogisticRegression(class_weight="auto")
dv = DictVectorizer()
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
# Get csv file into data frame
data = pd.read_csv("file.csv", header=0, encoding="utf-8")
df = DataFrame(data)
# Random sampling a smaller dataframe for debugging
rows = random.sample(df.index, 100000)
df = df.ix[rows] # Warning!!!! overwriting original df
# Assign X and y variables
X = df.raw_name.values
y = df.ethnicity2.values
# Feature extraction functions
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return '?'
except: return '?'
# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'last-name': feature_full_last_name(i)} for i in X]
all_dict = my_dict
newX = dv.fit_transform(all_dict).toarray()
# Separate the training and testing data sets
half_cut = int(len(df)/2.0)*-1
X_train = newX[:half_cut]
X_test = newX[half_cut:]
y_train = y[:half_cut]
y_test = y[half_cut:]
# Fitting X and y into model, using training data
lr.fit(X_train, y_train)
# Making predictions using trained data
y_train_predictions = lr.predict(X_train)
y_test_predictions = lr.predict(X_test)
print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0])
print (y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0])
编辑输出:
Frequent label
w/auto w/o auto
Error rate 0.22866 0.186724
Accuracy 0.77134 0.813276
Precision 0.921246774 0.854109238
Recall 0.511857815 0.636206455
Infrequent label
w/auto w/o auto
Error rate 0.098096 0.007652
Accuracy 0.901904 0.992348
Precision 0.995609966 0.992641816
Recall 0.047821338 0.780346821
【问题讨论】:
标签: python-2.7 machine-learning scikit-learn sampling logistic-regression