【问题标题】:ML Decision Tree classifier is only splitting on the same tree / asking about the same attributeML 决策树分类器仅在同一棵树上拆分/询问相同的属性
【发布时间】:2023-03-28 20:26:01
【问题描述】:

我目前正在使用 Gini 和 Information Gain 制作决策树分类器,并根据每次获得最大增益的最佳属性拆分树。但是,它每次都坚持相同的属性,只是调整其question 的值。这会导致非常低的准确度,通常约为 30%,因为它只考虑了第一个属性。

寻找最佳分割

 # Used to find the best split for data among all attributes

def split(r):
    max_ig = 0
    max_att = 0
    max_att_val = 0
    i = 0

    curr_gini = gini_index(r)
    n_att = len(att)

    for c in range(n_att):
        if c == 3:
            continue

        c_vals = get_column(r, c)

        while i < len(c_vals):
            # Value of the current attribute that is being tested
            curr_att_val = r[i][c]
            true, false = fork(r, c, curr_att_val)
            ig = gain(true, false, curr_gini)

            if ig > max_ig:
                max_ig = ig
                max_att = c
                max_att_val = r[i][c]
            i += 1

    return max_ig, max_att, max_att_val

比较根据真假将数据拆分为真

    # Used to compare and test if the current row is greater than or equal to the test value
# in order to split up the data

def compare(r, test_c, test_val):
    if r[test_c].isdigit():
        return r[test_c] == test_val

    elif float(r[test_c]) >= float(test_val):
        return True

    else:
        return False


# Splits the data into two lists for the true/false results of the compare test

def fork(r, c, test_val):
    true = []
    false = []

    for row in r:

        if compare(row, c, test_val):
            true.append(row)
        else:
            false.append(row)

    return true, false

遍历树

def rec_tree(r):
ig, att, curr_att_val = split(r)

if ig == 0:
    return Leaf(r)

true_rows, false_rows = fork(r, att, curr_att_val)

true_branch = rec_tree(true_rows)
false_branch = rec_tree(false_rows)

return Node(att, curr_att_val, true_branch, false_branch)

【问题讨论】:

    标签: python machine-learning classification decision-tree c4.5


    【解决方案1】:

    我的工作解决方案是按如下方式更改拆分功能。老实说,我看不出有什么问题,但这可能很明显 工作函数如下

    def split(r):
    max_ig = 0
    max_att = 0
    max_att_val = 0
    
    # calculates gini for the rows provided
    curr_gini = gini_index(r)
    no_att = len(r[0])
    
    # Goes through the different attributes
    
    for c in range(no_att):
    
        # Skip the label column (beer style)
    
        if c == 3:
            continue
        column_vals = get_column(r, c)
    
        i = 0
        while i < len(column_vals):
            # value we want to check
            att_val = r[i][c]
    
            # Use the attribute value to fork the data to true and false streams
            true, false = fork(r, c, att_val)
    
            # Calculate the information gain
            ig = gain(true, false, curr_gini)
    
            # If this gain is the highest found then mark this as the best choice
            if ig > max_ig:
                max_ig = ig
                max_att = c
                max_att_val = r[i][c]
            i += 1
    
    return max_ig, max_att, max_att_val
    

    【讨论】:

      猜你喜欢
      • 2020-04-15
      • 2017-02-04
      • 2012-05-06
      • 2020-08-05
      • 2018-04-11
      • 2018-03-28
      • 1970-01-01
      • 2018-10-20
      • 2022-07-19
      相关资源
      最近更新 更多