【问题标题】:MultiLabelBinarizer with duplicated values具有重复值的 MultiLabelBinarizer
【发布时间】:2021-04-23 13:29:14
【问题描述】:

我有一个预期的数组[1,1,3] 和一个预测的数组[1,2,2,4],我想要计算precision_recall_fscore_support,所以我需要一个格式如下的矩阵:

>> mlb = MultiLabelBinarizerWithDuplicates()
>> transformed = mlb.fit_transform([(1, 1, 3), (1, 2, 2, 4)])
array([[1,1,0,0,1,0],
       [1,0,1,1,0,1]])
>> mlb.classes_
[1,1,2,2,3,4]

对于重复的值,我不关心打开了哪一个,这意味着这也是一个有效的结果:

array([[1,1,0,0,1,0],
       [0,1,1,1,0,1]])

MultiLabelBinarizer 明确表示“所有条目都应该是唯一的(不能包含重复的类)”,因此它不支持此用例。

【问题讨论】:

    标签: python scikit-learn data-science


    【解决方案1】:

    有效的初始实现:

    import itertools
    from collections import defaultdict
    import copy
    import numpy as np
    
    class MultiLabelBinarizerWithDuplicates:
        """
        Similar to https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html
        but added support for duplicated values.
        """
    
        def __init__(self, mapping=None):
            self.mapping = mapping
    
        def fit(self, y):
            unique_label_max_count = {}
            for labels in y:
                unique_labels = set(labels)
                for unique_label in unique_labels:
                    max_count = unique_label_max_count.get(unique_label, [])
                    curr_count = [unique_label] * len([x for x in labels if x == unique_label])
                    if len(curr_count) > len(max_count):
                        unique_label_max_count[unique_label] = curr_count
    
            self.classes_ = sorted(list(itertools.chain.from_iterable(unique_label_max_count.values())))
            self.mapping = defaultdict(list)
            for class_, idx in zip(self.classes_, range(len(self.classes_))):
                self.mapping[class_].append(idx)
    
            return self
    
        def transform(self,y):
            result_matrix = []
            for labels in y:
                mapping_copy = copy.deepcopy(self.mapping)
                data = [0]*len(self.classes_)
                for label in labels:
                    if label in mapping_copy and len(mapping_copy[label]) > 0:
                        relevant_idx = mapping_copy[label].pop()
                        data[relevant_idx] = 1
                result_matrix.append(data)
            return np.array(result_matrix)
    
        def fit_transform(self,y):
            return self.fit(y).transform(y)
    
    

    用法:

    >> mlb = MultiLabelBinarizerWithDuplicates()
    >> transformed = mlb.fit_transform([(1, 1, 3), (1, 2, 2, 4)])
    array([[1,1,0,0,1,0],
           [1,0,1,1,0,1]])
    >> mlb.classes_
    [1,1,2,2,3,4]
    

    【讨论】:

      猜你喜欢
      • 2020-04-02
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-09-29
      • 2018-10-20
      • 1970-01-01
      • 2018-08-14
      • 2015-02-05
      相关资源
      最近更新 更多