【问题标题】:How to properly use K-Nearest-Neighbour?如何正确使用 K-Nearest-Neighbour?
【发布时间】:2017-02-10 17:25:20
【问题描述】:

我在 R 中生成了一些数据并将贝叶斯分类器应用于这些点。它们都被归类为“橙色”或“蓝色”。我无法从 knn 函数获得准确的结果,因为我认为类(“蓝色”、“橙色”)没有正确链接到 knn

我的训练数据在数据框(x, y) 中。我的课程在一个单独的数组中。我为贝叶斯分类器这样做了 - 它更容易绘制。然而,现在我不知道如何将我的课程“插入”到knn。使用以下代码非常不准确。我已将k 更改为许多不同的测试值,所有值都不准确。

library(class)

x <- round(runif(100, 1, 100))
y <- round(runif(100, 1, 100))
train.df <- data.frame(x, y)

x.test <- round(runif(100, 1, 100))
y.test <- round(runif(100, 1, 100))
test.df <- data.frame(x.test, y.test)

cl <- factor(c(rep("blue", 50), rep("orange", 50)))

k <- knn(train.df, test.df, cl, k=100)

再一次,我的排序类在代码中更靠前的数组classes 中。 这是我的完整文件。上面的代码在最底部。

library(class)

n <- 100
x <- round(runif(n, 1, n))
y <- round(runif(n, 1, n))

# ============================================================
# Bayes Classifier + Decision Boundary Code
# ============================================================

classes <- "null"
colours <- "null"

for (i in 1:n)
{

    # P(C = j | X = x, Y = y) = prob
    # "The probability that the class (C) is orange (j) when X is some x, and Y is some y"
    # Two predictors that influence classification: x, y
    # If x and y are both under 50, there is a 90% chance of being orange (grouping)
    # If x and y and both over 50, or if one of them is over 50, grouping is blue
    # Algorithm favours whichever grouping has a higher chance of success, then plots using that colour
    # When prob (from above) is 50%, the boundary is drawn

    percentChance <- 0
    if (x[i] < 50 && y[i] < 50)
    {
        # 95% chance of orange and 5% chance of blue
        # Bayes Decision Boundary therefore assigns to orange when x < 50 and y < 50
        # "colours" is the Decision Boundary grouping, not the plotted grouping
        percentChance <- 95
        colours[i] <- "orange"
    }
    else
    {
        percentChance <- 10
        colours[i] <- "blue"
    }

    if (round(runif(1, 1, 100)) > percentChance)
    {
        classes[i] <- "blue"
    }
    else
    {
        classes[i] <- "orange"
    }
}

boundary.x <- seq(0, 100, by=1)
boundary.y <- 0
for (i in 1:101)
{
    if (i > 49)
    {
        boundary.y[i] <- -10 # just for the sake of visual consistency, real value is 0
    }
    else
    {
        boundary.y[i] <- 50
    }
}
df <- data.frame(boundary.x, boundary.y)

plot(x, y, col=classes)
lines(df, type="l", lty=2, lwd=2, col="red")

# ============================================================
# K-Nearest neighbour code
# ============================================================

#library(class)

#x <- round(runif(100, 1, 100))
#y <- round(runif(100, 1, 100))
train.df <- data.frame(x, y)

x.test <- round(runif(n, 1, n))
y.test <- round(runif(n, 1, n))
test.df <- data.frame(x.test, y.test)

cl <- factor(c(rep("blue", 50), rep("orange", 50)))

k <- knn(train.df, test.df, cl, k=(round(sqrt(n))))

感谢您的帮助

【问题讨论】:

    标签: r machine-learning statistics classification nearest-neighbor


    【解决方案1】:

    首先,为了可重复性,您应该在生成一组随机数之前设置一个种子,如 runif 所做的那样,或者运行任何随机的模拟/ML 算法。请注意,在下面的代码中,我们为生成x 的所有实例设置了相同的种子,并为生成y 的所有实例设置了不同的种子。这样,伪随机生成的x 始终相同(但与y 不同),y 也是如此。

    library(class)
    
    n <- 100
    set.seed(1)
    x <- round(runif(n, 1, n))
    set.seed(2)
    y <- round(runif(n, 1, n))
    
    # ============================================================
    # Bayes Classifier + Decision Boundary Code
    # ============================================================
    
    classes <- "null"
    colours <- "null"
    
    for (i in 1:n)
    {
    
        # P(C = j | X = x, Y = y) = prob
        # "The probability that the class (C) is orange (j) when X is some x, and Y is some y"
        # Two predictors that influence classification: x, y
        # If x and y are both under 50, there is a 90% chance of being orange (grouping)
        # If x and y and both over 50, or if one of them is over 50, grouping is blue
        # Algorithm favours whichever grouping has a higher chance of success, then plots using that colour
        # When prob (from above) is 50%, the boundary is drawn
    
        percentChance <- 0
        if (x[i] < 50 && y[i] < 50)
        {
            # 95% chance of orange and 5% chance of blue
            # Bayes Decision Boundary therefore assigns to orange when x < 50 and y < 50
            # "colours" is the Decision Boundary grouping, not the plotted grouping
            percentChance <- 95
            colours[i] <- "orange"
        }
        else
        {
            percentChance <- 10
            colours[i] <- "blue"
        }
    
        if (round(runif(1, 1, 100)) > percentChance)
        {
            classes[i] <- "blue"
        }
        else
        {
            classes[i] <- "orange"
        }
    }
    
    boundary.x <- seq(0, 100, by=1)
    boundary.y <- 0
    for (i in 1:101)
    {
        if (i > 49)
        {
            boundary.y[i] <- -10 # just for the sake of visual consistency, real value is 0
        }
        else
        {
            boundary.y[i] <- 50
        }
    }
    df <- data.frame(boundary.x, boundary.y)
    
    plot(x, y, col=classes)
    lines(df, type="l", lty=2, lwd=2, col="red")
    
    # ============================================================
    # K-Nearest neighbour code
    # ============================================================
    
    #library(class)
    set.seed(1)
    x <- round(runif(n, 1, n))
    
    set.seed(2)
    y <- round(runif(n, 1, n))
    train.df <- data.frame(x, y)
    
    set.seed(1)
    x.test <- round(runif(n, 1, n))
    set.seed(2)
    y.test <- round(runif(n, 1, n))
    test.df <- data.frame(x.test, y.test)
    
    我认为主要问题在这里。我认为您想将从贝叶斯分类器获得的类标签传递给knn,即向量classes。相反,您传递的是cl,它们只是test.df 中案例的顺序标签,即没有意义。
    #cl <- factor(c(rep("blue", 50), rep("orange", 50)))
    
    k <- knn(train.df, test.df, classes, k=25)
    plot(test.df$x.test, test.df$y.test, col=k)
    

    【讨论】:

    • 嘿,谢谢!你是全明星。所以,如果我想画一条 knn 边界线,我只需要遍历我的测试点和classes,找到属于“橙色”的最外层点,然后在它们周围画一条线?
    • @KingDan,不确定,但你可以试试。同时,如果答案与原始问题相符,您可以接受它;-),干杯
    猜你喜欢
    • 2021-05-22
    • 2015-08-07
    • 1970-01-01
    • 2014-06-28
    • 2014-09-01
    • 2016-10-12
    • 2011-06-24
    • 2021-06-12
    • 2012-03-20
    相关资源
    最近更新 更多