【问题标题】:Visual Comparison of Regression & PCA回归和 PCA 的视觉比较
【发布时间】:2011-12-10 14:46:01
【问题描述】:

我正在尝试完善一种比较回归和 PCA 的方法,受博客 Cerebral Mastication 的启发,SO 上也从不同角度进行了讨论。在我忘记之前,非常感谢 JD Long 和 Josh Ulrich 的大部分核心内容。我将在下学期的课程中使用它。对不起,这很长!

更新:我发现了一种几乎可以工作的不同方法(如果可以,请修复它!)。我把它贴在了底部。比我想出的更聪明、更短的方法!

我基本上在一定程度上遵循了之前的方案:生成随机数据,找出最佳拟合线,绘制残差。这显示在下面的第二个代码块中。但我也挖掘并编写了一些函数来绘制垂直于通过随机点(本例中的数据点)的线的线。我认为这些工作正常,并且它们显示在 First Code Chunk 中以及它们的工作证明。

现在,第二个代码块使用与 @JDLong 相同的流程显示整个操作,我正在添加结果图的图像。黑色数据,红色是带有残差的回归粉红色,蓝色是第一个 PC,浅蓝色应该是法线,但显然不是。 First Code Chunk 中绘制这些法线的函数看起来不错,但演示中有些地方不对:我想我一定是误解了某些东西或传递了错误的值。我的法线是水平的,这似乎是一个有用的线索(但到目前为止,对我来说不是)。谁能看出这里出了什么问题?

谢谢,这件事困扰了我一段时间...

第一个代码块(绘制法线并证明它们有效的函数):

##### The functions below are based very loosely on the citation at the end

pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
    # Px, Py is the point to test, can be a vector.
    # slope, intercept is the line to check distance.

    Ax <- Px-10*diff(range(Px))
    Bx <- Px+10*diff(range(Px))
    Ay <- Ax * slope + intercept
    By <- Bx * slope + intercept
    pointOnLine(Px, Py, Ax, Ay, Bx, By)
    }

pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {

    # This approach based upon comingstorm's answer on
    # stackoverflow.com/questions/3120357/get-closest-point-to-a-line
    # Vectorized by Bryan

    PB <- data.frame(x = Px - Bx, y = Py - By)
    AB <- data.frame(x = Ax - Bx, y = Ay - By)
    PB <- as.matrix(PB)
    AB <- as.matrix(AB)
    k_raw <- k <- c()
    for (n in 1:nrow(PB)) {
        k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
        if (k_raw[n] < 0)  { k[n] <- 0
            } else { if (k_raw[n] > 1) k[n] <- 1
                else k[n] <- k_raw[n] }
        }
    x = (k * Ax + (1 - k)* Bx)
    y = (k * Ay + (1 - k)* By)
    ans <- data.frame(x, y)
    ans
    }

# The following proves that pointOnLineNearPoint
# and pointOnLine work properly and accept vectors

par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted
# and right angles don't appear as right angles

m <- runif(1, -5, 5)
b <- runif(1, -20, 20)
plot(-20:20, -20:20, type = "n", xlab = "x values", ylab = "y values")
abline(b, m )

Px <- rnorm(10, 0, 4)
Py <- rnorm(10, 0, 4)

res <- pointOnLineNearPoint(Px, Py, m, b)
points(Px, Py, col = "red")
segments(Px, Py, res[,1], res[,2], col = "blue")

##========================================================
##
##  Credits:
##  Theory by Paul Bourke http://local.wasp.uwa.edu.au/~pbourke/geometry/pointline/
##  Based in part on C code by Damian Coventry Tuesday, 16 July 2002
##  Based on VBA code by Brandon Crosby 9-6-05 (2 dimensions)
##  With grateful thanks for answering our needs!
##  This is an R (http://www.r-project.org) implementation by Gregoire Thomas 7/11/08
##
##========================================================

第二个代码块(绘制演示):

set.seed(55)
np <- 10 # number of data points
x <- 1:np
e <- rnorm(np, 0, 60)
y <- 12 + 5 * x + e

par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted

plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals")
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")

# pca "by hand"
xyNorm <- cbind(x = x - mean(x), y = y - mean(y)) # mean centers
xyCov <- cov(xyNorm)
eigenValues <- eigen(xyCov)$values
eigenVectors <- eigen(xyCov)$vectors

# Add the first PC by denormalizing back to original coords:

new.y <- (eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x]) + mean(y)
lines(x, new.y, col = "blue", lwd = 2)

# Now add the normals

yx2.lm <- lm(new.y ~ x) # zero residuals: already a line
res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
points(res[,1], res[,2], col = "blue", pch = 20) # segments should end here
segments(x, y, res[,1], res[,2], col = "lightblue1") # the normals
############ 更新

Vincent Zoonekynd's Page 我几乎完全找到了我想要的东西。但是,它并不完全有效(显然曾经有效)。这是该站点的代码摘录,它绘制了通过垂直轴反射的第一台 PC 的法线:

set.seed(1)
x <- rnorm(20)
y <- x + rnorm(20)
plot(y~x, asp = 1)
r <- lm(y~x)
abline(r, col='red')

r <- princomp(cbind(x,y))
b <- r$loadings[2,1] / r$loadings[1,1]
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue")
title(main='Appears to use the reflection of PC1')

u <- r$loadings
# Projection onto the first axis
p <- matrix( c(1,0,0,0), nrow=2 )
X <- rbind(x,y)
X <- r$center + solve(u, p %*% u %*% (X - r$center))
segments( x, y, X[1,], X[2,] , col = "lightblue1")

结果如下:

【问题讨论】:

    标签: r regression linear-regression pca


    【解决方案1】:

    好的,我必须回答我自己的问题!经过进一步阅读和比较人们在互联网上的方法,我已经解决了这个问题。我不确定我是否可以清楚地说明我“修复”了什么,因为我经历了很多次迭代。无论如何,这是情节和代码(MWE)。为了清楚起见,帮助函数位于末尾。

    # Comparison of Linear Regression & PCA
    # Generate sample data
    
    set.seed(39) # gives a decent-looking example
    np <- 10 # number of data points
    x <- -np:np
    e <- rnorm(length(x), 0, 10)
    y <- rnorm(1, 0, 2) * x + 3*rnorm(1, 0, 2) + e
    
    # Plot the main data & residuals
    
    plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals", asp = 1)
    yx.lm <- lm(y ~ x)
    lines(x, predict(yx.lm), col = "red", lwd = 2)
    segments(x, y, x, fitted(yx.lm), col = "pink")
    
    # Now the PCA using built-in functions
    # rotation = loadings = eigenvectors
    
    r <- prcomp(cbind(x,y), retx = TRUE)
    b <- r$rotation[2,1] / r$rotation[1,1] # gets slope of loading/eigenvector 1
    a <- r$center[2] - b * r$center[1]
    abline(a, b, col = "blue") # Plot 1st PC
    
    # Plot normals to 1st PC
    
    X <- pointOnLineNearPoint(x, y, b, a)
    segments( x, y, X[,1], X[,2], col = "lightblue1")
    
    ###### Needed Functions
    
    pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
        # Px, Py is the point to test, can be a vector.
        # slope, intercept is the line to check distance.
    
        Ax <- Px-10*diff(range(Px))
        Bx <- Px+10*diff(range(Px))
        Ay <- Ax * slope + intercept
        By <- Bx * slope + intercept
        pointOnLine(Px, Py, Ax, Ay, Bx, By)
        }
    
    pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {
    
        # This approach based upon comingstorm's answer on
        # stackoverflow.com/questions/3120357/get-closest-point-to-a-line
        # Vectorized by Bryan
    
        PB <- data.frame(x = Px - Bx, y = Py - By)
        AB <- data.frame(x = Ax - Bx, y = Ay - By)
        PB <- as.matrix(PB)
        AB <- as.matrix(AB)
        k_raw <- k <- c()
        for (n in 1:nrow(PB)) {
            k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
            if (k_raw[n] < 0)  { k[n] <- 0
                } else { if (k_raw[n] > 1) k[n] <- 1
                    else k[n] <- k_raw[n] }
            }
        x = (k * Ax + (1 - k)* Bx)
        y = (k * Ay + (1 - k)* By)
        ans <- data.frame(x, y)
        ans
        }
    

    【讨论】:

      【解决方案2】:

      尝试更改这行代码:

      res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
      

      res <- pointOnLineNearPoint(x, new.y, yx2.lm$coef[2], yx2.lm$coef[1])
      

      所以你调用了正确的 y 值。

      【讨论】:

      • 啊,我可能不太清楚。浅蓝色线应垂直(法线)于蓝色线,并从原始数据(黑色空心圆圈)开始。谢谢。
      【解决方案3】:

      Vincent Zoonekynd's code 中,将u &lt;- r$loadings 行更改为u &lt;- solve(r$loadings)。在solve() 的第二个实例中,沿第一个主轴的预测组件得分(即,第二个预测组件得分设置为零的预测得分矩阵)需要乘以 inverse载荷/特征向量。将数据乘以载荷得出预测分数;将预测分数除以负载给出数据。希望对您有所帮助。

      【讨论】:

      • 有趣。谢谢你。我确信文森特的代码曾经可以工作。我想知道问题是如何出现的。他一定发布了代码草案,但最终代码中的数字。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-08-10
      • 2020-09-18
      • 2013-08-20
      • 2017-07-11
      • 2018-06-17
      • 2018-08-09
      • 1970-01-01
      相关资源
      最近更新 更多