在 Python 中基于特征列表生成数据集向量答案

【问题标题】：Generate vectors of dataset based on a feature list, in Python在 Python 中基于特征列表生成数据集向量
【发布时间】：2021-09-21 02:48:27
【问题描述】：

我需要根据数据集的特征总量为数据集中的每个样本生成一个向量。

# Assume the dataset has 6 features
features = ['a', 'b', 'c', 'd', 'e', 'f']

# Examples:

s1 = ['a', 'b', 'c']
# For s1, I want to generate a vector to represent features 
r1 = [1, 1, 1, 0, 0, 0]

s2 = ['a', 'c', 'f']
# For s2 then the vector should be
r2 = [1, 0, 1, 0, 0, 1]

是否有任何 python 库来完成这项任务？如果没有，我该怎么做？

【问题讨论】：

在 google 上搜索任何 python 库，而不是 SO。
我认为这会有所帮助：pypi.org/project/vector

标签： python data-processing

【解决方案1】：

这很简单，并不是你需要一个库来做的事情。

纯 Python 解决方案

features = ['a', 'b', 'c', 'd', 'e', 'f']
features_lookup = dict(map(reversed, enumerate(features)))


s1 = ['a', 'b', 'c']
s2 = ['a', 'c', 'f']


def create_feature_vector(sample, lookup):
    vec = [0]*len(lookup)
    for value in sample:
        vec[lookup[value]] = 1
    return vec

输出：

>>> create_feature_vector(s1, features_lookup)
[1, 1, 1, 0, 0, 0]

>>> create_feature_vector(s2, features_lookup)
[1, 0, 1, 0, 0, 1]

单个特征向量的 Numpy 替代方案

如果您碰巧已经在使用 numpy，如果您的功能集很大，这将更加更有效率：

import numpy as np


features = np.array(['a', 'b', 'c', 'd', 'e', 'f'])
sample_size = 3


def feature_sample_and_vector(sample_size, features):
    n = features.size
    sample_indices = np.random.choice(range(n), sample_size, replace=False)
    sample = features[sample_indices]
    vector = np.zeros(n, dtype="uint8")
    vector[sample_indices] = 1
    return sample, vector

大量样本及其特征向量的 Numpy 替代方案

使用 numpy 可以很好地扩展大型特征集和/或大型样本集。请注意，这种方法会产生重复的样本：

import random
import numpy as np


# Assumes features is already a numpy array
def generate_samples(features, num_samples, sample_size):
    n = features.size
    vectors = np.zeros((num_samples, n), dtype="uint8")
    idxs = [random.sample(range(n), k=sample_size) for _ in range(num_samples)]
    cols = np.sort(np.array(idxs), axis=1)  # You can remove the sort if having the features in order isn't important
    rows = np.repeat(np.arange(num_samples).reshape(-1, 1), sample_size, axis=1)
    vectors[rows, cols] = 1
    samples = features[cols]
    return samples, vectors

演示：

>>> generate_samples(features, 10, 3)
(array([['d', 'e', 'f'],
        ['a', 'b', 'c'],
        ['c', 'd', 'e'],
        ['c', 'd', 'f'],
        ['a', 'b', 'f'],
        ['a', 'e', 'f'],
        ['c', 'd', 'f'],
        ['b', 'e', 'f'],
        ['b', 'd', 'f'],
        ['a', 'c', 'e']], dtype='<U1'),
 array([[0, 0, 0, 1, 1, 1],
        [1, 1, 1, 0, 0, 0],
        [0, 0, 1, 1, 1, 0],
        [0, 0, 1, 1, 0, 1],
        [1, 1, 0, 0, 0, 1],
        [1, 0, 0, 0, 1, 1],
        [0, 0, 1, 1, 0, 1],
        [0, 1, 0, 0, 1, 1],
        [0, 1, 0, 1, 0, 1],
        [1, 0, 1, 0, 1, 0]], dtype=uint8))

一个非常简单的时序基准测试，从 26 个特征的特征集中对 100,000 个大小为 12 的样本：

In [2]: features = np.array(list("abcdefghijklmnopqrstuvwxyz"))

In [3]: num_samples = 100000

In [4]: sample_size = 12

In [5]: %timeit generate_samples(features, num_samples, sample_size)
645 ms ± 9.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

唯一真正的瓶颈是生成索引所需的列表理解。不幸的是，没有使用np.random.choice() 来生成不替换样本的二维变体，因此您仍然不得不求助于一种相对较慢的方法来生成随机样本索引。

【讨论】：

【解决方案2】：

可能不是最优化的，但如果您想为数据集中的每个样本创建一个向量，您只需为 0 到 2 之间的每个数字创建一个二进制数组⁶：

features = ['a', 'b', 'c', 'd', 'e', 'f']
l = len(features)
vectors = [[int(y) for y in f'{x:0{l}b}'] for x in range(2 ** l)] 

print(vectors);

【讨论】：

这里要注意的是，这会产生特征的幂集，包括一个空样本。要获取大小为 n 的所有样本，您需要执行一些操作 [v for v in vectors if sum(v) == n]。