根据R中的相关阈值从数据帧中提取数据点答案

【问题标题】：Extract data points from a dataframe based on correlation threshold in R根据R中的相关阈值从数据帧中提取数据点
【发布时间】：2021-06-09 03:12:33
【问题描述】：

我有一个数据框，其中包含两列 X 和 Y。我们如何从该数据框中提取一定数量的行，使 X 和 Y 之间的相关性小于某个阈值？例如，我想从下面给出的数据框中提取 10 行，使得 X 和 Y 之间的相关性小于 0.4。

df <- structure(list(X = c(0.47, 0.4723, 0.4747, 0.4771, 0.4794, 0.4818, 0.4842, 0.4866, 0.4889, 0.4913, 0.4937, 0.4961, 0.4984, 0.5008, 0.5032, 0.5103, 0.5173, 0.5244, 0.5315, 0.5386, 0.5457, 0.5527, 0.5598, 0.5709, 0.582, 0.593, 0.6041, 0.6152, 0.6263, 0.6373), Y = c(NaN, 255.5, 440, 110.5, 197.25, 438, 100, 467.75, 483.5, 492.25, 489.25, 503, 511.25, 508.25, 505, 511, 503.33, 501, 509.25, 508.25, 165.33, 102, 461.25, 392, 530.75, 537.75, NaN, 601, 523, 120)), row.names = c(NA, 30L), class = "data.frame")

另外，如果多组这样的数据点满足上述条件，那么我想提取所有这些可能的集合。

【问题讨论】：

仅供参考，您的 30 行中有 choose(30, 10) - 3000 万+ 可能的 10 行集。
正确！我想应该有一些聪明的方法来搜索数据。
-.9 的相关性怎么样？低于 0.4。由于您的示例在使用 NaN 删除两行后产生了 0.099 的相关性，因此您需要选择剩余的 18 行：cor(df.mat[sample(rows, 18), ])[1, 2]。有些会超过 0.4，但请跳过那些。
@dcarlson Set 对应于 -0.9 的 corr 也将满足 corr

标签： r dataframe correlation purrr lubridate

【解决方案1】：

从你的df开始：

df <- na.omit(df)         # Remove rows with NA/NaN 
df.mat <- as.matrix(df)   # Convert to matrix for `cor`
rows <- row.names(df.mat) # Get row names for sampling
cor(df.mat)[1, 2]         # Get correlation
# [1] 0.09961473
set.seed(42)              # Set seed for this example
(sub <- sample(rows, 18))
#  [1] "18" "6"  "2"  "26" "11" "5"  "19" "30" "16" "8"  "24" "29" "15" "23" "25" "3"  "4"  "10"
cor(df.mat[sub, ])[1, 2]
# [1] 0.1082971
(sub <- sample(rows, 18))
# [1] "26" "29" "5"  "6"  "14" "30" "21" "3"  "9"  "4"  "2"  "11" "23" "15" "12" "7"  "18" "17"
cor(df.mat[sub, ])[1, 2]
# [1] 0.07151581

【讨论】：