在 Python sklearn 中，如何在测试/训练数据中检索样本/变量的名称？答案

【问题标题】：In Python sklearn, how do I retrieve the names of samples/variables in test/training data?在 Python sklearn 中，如何在测试/训练数据中检索样本/变量的名称？
【发布时间】：2017-10-12 06:55:44
【问题描述】：

#I have imported the dataset with pandas
df = pd.read_csv(filename)
####Preparing data for sklearn
#1)Dropped the names of each sample
df.drop(['id'], 1, inplace=True)
#2)Isolate data and remove column with classification (X) and isolation classification column (y)
X = np.array(df.drop(['class'],1))
y = np.array(df['class'])
######
#Split data into testing/training datasets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y,test_size=0.4)

问题：如果我想要测试/训练数据中的样本名称（测试后），我该如何检索它们？

【问题讨论】：

由于train_test_split 的输出是np.arrays，如果将此列表也转换为np.array，则可以将它们用作包含名称的列上的索引。
嗨，由于您删除了包含名称的 id 列，因此您需要再次重新加载 csv 并使用测试/训练数据集来检索 id
显然我可以在删除名称之前执行此操作。 IndexError：用作索引的数组必须是整数（或布尔）类型

标签： python pandas scipy scikit-learn

【解决方案1】：

如果您将id 设为df 的索引，则在运行train_test_split 后您将保留索引值。首先，让我们生成一些示例数据：

import numpy as np
import pandas as pd

N = 10
ids = ['a','b','c','d','e','f','g','h','i','j']
values = np.random.random(N)
classes = np.random.binomial(n=1,p=.5,size=N)
df = pd.DataFrame({'id':ids,'predictor':values,'label':classes})

然后显式设置id为索引：

df.set_index('id', inplace=True)

现在df 看起来像这样：

    label  predictor
id                  
a       1   0.214636
b       0   0.466477
c       1   0.300480
d       1   0.378645
e       0   0.755834
f       1   0.506719
g       0   0.948360
h       0   0.736498
i       1   0.058591
j       1   0.997003

使用 Pandas 对象拆分为训练/测试集将保留其原始索引值：

X = df.predictor
y = df.label

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

print(X_train)
id
a    0.214636
b    0.466477
d    0.378645
j    0.997003
i    0.058591
f    0.506719
Name: predictor, dtype: float64

【讨论】：

虽然这保留了 X_train 的标识符，但据我所知，您在实际使用 X_train 时必须删除 id，而且只有在这一点上我才需要标识符。虽然这不是一个理想的解决方案，但我最终只使用了一个倒置字典，类似于 cmets 中的 neox 所建议的。对不起！为清楚起见，对问题稍作修改。