如何使用 FactoMineR 包以编程方式确定主成分的列索引？答案

【问题标题】：How to programmatically determine the column indices of principal components using FactoMineR package?如何使用 FactoMineR 包以编程方式确定主成分的列索引？
【发布时间】：2018-12-25 01:14:20
【问题描述】：

给定一个包含混合变量（即分类变量和连续变量）的数据框，例如，

digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
                 studLoc=sample(createRandString(10)),
                 finalmark=sample(c(0:100),10),
                 subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
                 )

我使用包FactoMineR 执行无监督特征选择

df.princomp <- FactoMineR::FAMD(df, graph = FALSE)

变量df.princomp 是一个列表。

此后，为了可视化我使用的主要组件 fviz_screeplot() 和 fviz_contrib() 喜欢，

#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
                           barfill = "gray", barcolor = "black",
                           ylim = c(0, 50), xlab = "Principal Component", 
                           ylab = "Percentage of explained variance",
                           main = "Principal Component (PC) for mixed variables")

factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = 1, top = 10, sort.val = c("desc"))

它给出了以下 Fig1

和图2

图1说明：图1是碎石图。碎石图是一个简单的线段图，它显示了数据中总方差的分数，如每个主成分 (PC) 所解释或表示的。所以我们可以看到前三台 PC 共同负责总方差的43.8%。现在自然会出现一个问题，“这些变量是什么？”。我在图 2 中显示了这一点。

图2说明：该图可视化了主成分分析（PCA）结果中行/列的贡献。从这里我可以看到变量，name、studLoc 和 finalMark 是可以用于进一步分析的最重要的变量。

进一步分析 - 我遇到的问题：推导出上述变量 name、studLoc、finalMark 的贡献。我使用主成分变量df.princomp（见上文），如df.princomp$quanti.var$contrib[,4]和df.princomp$quali.var$contrib[,2:3]。

我必须手动指定列索引[,2:3] 和[,4]。

我想要什么：我想知道如何做动态列索引分配，这样我就不必手动编码列表df.princomp中的列索引[,2:3]？

我已经看过以下类似的问题1、2、3 和4，但找不到我的解决方案？解决此问题的任何帮助或建议都会有所帮助。

【问题讨论】：

所以要明确一点，这里想要的结果到底是什么？图2中的柱高？
@MikkoMarttila 感谢您的关注。这很有帮助。我已经更新了这个问题，以便其他人可以更好地理解它（不仅仅是我:)）。我确信这是一个微不足道的答案，但我似乎无法理解它。
所以你说你想要一些函数f(pc1,pc2) 这样如果 pc1 是第二个组件，而 pc2 是第三个组件，那么你会得到 df.princomp$quanti.var$contrib[,2:3] 和 df.princomp$quali.var$contrib[,2:3]？
顺便说一句，您的代码并非完全可重现，它需要一个随机种子。当我运行它时，我得到的 pcas 略有不同
@Ashish 说你相信df.princomp$quanti.var$contrib[,4] 是finalMark 对Dim-1 的贡献对吗？如果是这样，我建议这种解释是不正确的，相反，finalMark 对 Dim-1 的贡献可以在这里找到 df.princomp$quanti.var$contrib["finalmark", "Dim.1"] 同样，名称和 studLoc 对 Dim-1 的贡献可以在 df.princomp$quali.var$contrib[1:10, "Dim.1"] %>% sum() 和 @987654356 找到@ 分别。请注意，这些是上面图 2 的前三列

标签： r cluster-analysis pca feature-selection unsupervised-learning

【解决方案1】：

不确定我对您问题的解释是否正确，如果不正确，请道歉。据我所知，您使用 PCA 作为初始工具，向您展示哪些变量在解释数据集时最重要。然后，您想回到原始数据，快速选择这些变量，无需每次手动编码，并将它们用于其他分析。

如果这是正确的，那么我已经保存了贡献图中的数据，过滤掉了贡献最大的变量，并使用该结果单独创建了一个包含这些变量的新数据框。

digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
                 studLoc=sample(createRandString(10)),
                 finalmark=sample(c(0:100),10),
                 subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)

df.princomp <- FactoMineR::FAMD(df, graph = FALSE)

factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
                           barfill = "gray", barcolor = "black",
                           ylim = c(0, 50), xlab = "Principal Component", 
                           ylab = "Percentage of explained variance",
                           main = "Principal Component (PC) for mixed variables")

#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)

f<-factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = c(1), top = 10, sort.val = c("desc"))

#save data from contribution plot
dat<-f$data

#filter out ID's that are higher than, say, 20

r<-rownames(dat[dat$contrib>20,])

#extract these from your original data frame into a new data frame for further analysis

new<-df[r]

new

#finalmark name    studLoc
#1         53    b POTYQ0002N
#2         73    i LWMTW1195I
#3         95    d VTUGO1685F
#4         39    f YCGGS5755N
#5         97    c GOSWE3283C
#6         58    g APBQD6181U
#7         67    a VUJOG1460V
#8         64    h YXOGP1897F
#9         15    j NFUOB6042V
#10        81    e QYTHG0783G

根据您的评论，您说您想“在 Dim.1 和 Dim.2 中查找值大于 5 的变量并将这些变量保存到新数据框中”，我会这样做：

#top contributors to both Dim 1 and 2

f<-factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = c(1,2), top = 10, sort.val = c("desc"))

#save data from contribution plot
dat<-f$data

#filter out ID's that are higher than 5

r<-rownames(dat[dat$contrib>5,])

#extract these from your original data frame into a new data frame for further analysis

new<-df[r]

new

（这将所有原始变量保留在我们的新数据框中，因为它们对总方差的贡献超过 5%）

【讨论】：

@j-con 你的解释是绝对正确的。我使用 PCA 作为无监督的特征选择方法。我有一种直觉，认为这是一个简单的答案，但我无法理解它。非常感谢。请给我一些时间incubate :-) 关于这个解决方案，我会尽快回复你。
酷！不用担心。我已经为这样的事情拉了很多头发，我发现从绘图输出中保存数据是一个非常有价值的技巧。
@j-con，感谢您的建议saving the data from the plot output is a very valuable trick。这是缺失的部分。您的解决方案按预期工作，也非常感谢所有其他用户，他们极大地帮助完善了问题和后续答案。没有你的帮助，这是不可能的。干杯。

【解决方案2】：

有很多方法可以提取单个变量对 PC 的贡献。对于数字输入，可以使用prcomp 运行PCA 并查看$rotation（我很快就和你谈过了，忘记了你这里有因素，所以prcomp 不能直接工作）。由于您使用的是factoextra::fviz_contrib，因此检查该函数如何在后台提取此信息是有意义的。键入factoextra::fviz_contrib并读取函数：

> factoextra::fviz_contrib
function (X, choice = c("row", "col", "var", "ind", "quanti.var", 
    "quali.var", "group", "partial.axes"), axes = 1, fill = "steelblue", 
    color = "steelblue", sort.val = c("desc", "asc", "none"), 
    top = Inf, xtickslab.rt = 45, ggtheme = theme_minimal(), 
    ...) 
{
    sort.val <- match.arg(sort.val)
    choice = match.arg(choice)
    title <- .build_title(choice[1], "Contribution", axes)
    dd <- facto_summarize(X, element = choice, result = "contrib", 
        axes = axes)
    contrib <- dd$contrib
    names(contrib) <- rownames(dd)
    theo_contrib <- 100/length(contrib)
    if (length(axes) > 1) {
        eig <- get_eigenvalue(X)[axes, 1]
        theo_contrib <- sum(theo_contrib * eig)/sum(eig)
    }
    df <- data.frame(name = factor(names(contrib), levels = names(contrib)), 
        contrib = contrib)
    if (choice == "quanti.var") {
        df$Groups <- .get_quanti_var_groups(X)
        if (missing(fill)) 
            fill <- "Groups"
        if (missing(color)) 
            color <- "Groups"
    }
    p <- ggpubr::ggbarplot(df, x = "name", y = "contrib", fill = fill, 
        color = color, sort.val = sort.val, top = top, main = title, 
        xlab = FALSE, ylab = "Contributions (%)", xtickslab.rt = xtickslab.rt, 
        ggtheme = ggtheme, sort.by.groups = FALSE, ...) + geom_hline(yintercept = theo_contrib, 
        linetype = 2, color = "red")
    p
}
<environment: namespace:factoextra>

所以它实际上只是从同一个包中调用facto_summarize。以此类推，你可以做同样的事情，只需调用：

> dd <- factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = 1)
> dd
               name    contrib
ID               ID  0.9924561
finalmark finalmark 21.4149175
subj1mark subj1mark  7.1874438
subj2mark subj2mark 16.6831560
name           name 26.8610132
studLoc     studLoc 26.8610132

这是与您的图 2 对应的表格。对于 PC2，请使用 axes = 2 等等。

关于“如何以编程方式确定 PC 的列索引”，我不是 100% 确定我理解你想要什么，但如果你只想说列“finalmark”，抓住它对 PC3 的贡献，你可以执行以下操作：

library(tidyverse)
# make a tidy table of all column names in the original df with their contributions to all PCs
contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC")

# get the contribution of column 'finalmark' by name
contribution_df %>%
  filter(name == "finalmark")

# get the contribution of column 'finalmark' to PC3
contribution_df %>%
  filter(name == "finalmark" & PC == 3)

# or, just the numeric value of contribution
filter(contribution_df, name == "finalmark" & PC == 3)$contrib

顺便说一句，我认为您的示例中的 ID 被视为数字而不是因子，但由于这只是一个示例，我不介意。

【讨论】：

感谢指针查看函数结构。这很有帮助。请务必编辑您的答案（如果可能）以包括那些“提取单个变量对 PC 的贡献的多种方法”。这对我和其他感兴趣的人会很有帮助。进一步，您能否详细说明变量dd 中包含的“如何以编程方式确定PC 的列索引”并将其映射到原始数据帧？谢谢。
更新了答案。这有帮助吗？我想我仍然没有 100% 理解您所说的“如何以编程方式确定 PC 的列索引”的意思，但是鉴于所有列名对所有 PC 的贡献表，我想您可以提取任何您想要的内容?
感谢您更新答案。我所说的“以编程方式确定列索引”的意思是，在知道本质上是变量的 PC 之后。是否可以获得这些变量的列索引？”这样我就不必对它们进行硬编码？我希望它有意义吗？欢迎任何想法或建议。
我已经用完了例子来进一步解释它。我已经尽可能详细（包括问题和 cmets）。感谢您迄今为止的所有投入，
我在某处读到前两台或三台 PC 的数据差异最大（暂时无法回忆来源）。所以基于上面代码中的这个推论，df.princomp$quanti.var$contrib[,1:5]包含了变量在Dim.1、Dim.2等5个维度上的贡献。是否可以为“在 Dim.1 AND Dim.2 中查找值大于 5 的变量并将这些变量保存到新数据帧”之类的问题编写代码。一旦确定了这些变量，我就可以将它们映射到原始数据框。 PCA 类似于特征选择。