使用 R 将数据框更改为合适的热图矩阵答案

【问题标题】：Using R to change data frame to suitable heatmap matrix使用 R 将数据框更改为合适的热图矩阵
【发布时间】：2017-03-10 21:27:32
【问题描述】：

如果这是一个问得不好的问题，我深表歉意。这是我在 StackOverflow 上的第一个问题。

我有一些应用程序的使用数据，我试图将其转换为热图，以显示跨应用程序的用户之间的重叠使用情况。我很难将数据转换为适合在 corrplot（我首选的热图可视化包）中可视化热图的格式。

数据经过格式化，因此每个可能的应用使用组合都表示为一行（例如，单独的 app1、单独的 app2、单独的 app1+app2、单独的 app3、app1+app3、app2+app3、app1+app2+app3 等）属于该特定应用使用配置的用户的相应数量（例如，仅使用过 app1 和 app3 的用户将向该特定行贡献 1）。

应用启动数据示例：

df.start <- data.frame(appset = c("[app1]","[app2]","[app3]","[app1;app2]","[app2;app3]","[app1;app3]","[app1;app2;app3"]),
                       unique_users = c(1000, 400, 150, 300, 30, 130,10))

我想最终将数据放入具有以下属性的表单中：
1）每一行和每一列都代表一个应用程序（如相关矩阵），因此对于 3 个应用程序集，它应该是一个 3x3 矩阵，其中行是“app1”“app2”“app3”，列也是“app1” 'app2' 'app3'
(如果更容易按列规范化，这也可以）

我的目标是让它看起来像这样：

df.end <- data.frame(app1 = c(1, 310/1440, 140/1440),
                     app2 = c(310/740, 1, 40/740),
                     app3 = c(140/320, 40/320, 1))
row.names(df.end) <- c('app1','app2','app3')

（我一直将数字作为“300/1430”之类的比率包含在内，以演示我希望在每一行上进行的计算类型以标准化数据，但最终该值应该在该实例中显示为 .20979；它是如何将通过运行该代码出现在 R 中，这是我希望它出现的方式）

我不喜欢以这种形式获取数据，我最终只需要一种方法来可视化跨应用程序的交叉使用关系，而热图过去曾为我提供了很好的这些目的。我需要的是：
1）使用它们的名称自动检测数据中的应用程序以生成矩阵的行和列（因为我有不止 3 个示例应用程序，并且希望针对不同目的在感兴趣的应用程序的各种组合上重新运行代码)
2) 数字表示为应用之间的比率，以便两个方向都表示在数据中的某个位置（例如，同时使用 app2 的 app1 用户的比率以及也使用 app1 的 app2 用户的比率）。

我已经手动完成了单个单元格的计算（将结果复制并粘贴到 excel 中以匹配我需要的表格），但这对于可重现的结果和应用到新数据集来说显然是一种糟糕的方法。

将应用程序集分成我开始的列：

df.start <- mutate(df.start, 
                   app1 = ifelse(grepl("app1", df.start$appset),TRUE,FALSE),
                   app2 = ifelse(grepl("app2", df.start$appset),TRUE,FALSE),
                   app3 = ifelse(grepl("app3", df.start$appset),TRUE,FALSE))

查找每个用户的唯一用户总数（用于稍后对行进行规范化）：

total_app1 <- sum(df.start$unique_users[df.start$app1])
total_app2 <- sum(df.start$unique_users[df.start$app2])
total_app3 <- sum(df.start$unique_users[df.start$app3])

然后手工生成标准化数据的各个单元格复制粘贴到excel中：

sum(df.start$unique_users[df.start$app1 & df.start$app1])/total_app1
sum(df.start$unique_users[df.start$app1 & df.start$app2])/total_app1
sum(df.start$unique_users[df.start$app1 & df.start$app3])/total_app1

sum(df.start$unique_users[df.start$app2 & df.start$app1])/total_app2
sum(df.start$unique_users[df.start$app2 & df.start$app2])/total_app2
sum(df.start$unique_users[df.start$app2 & df.start$app3])/total_app2

sum(df.start$unique_users[df.start$app3 & df.start$app1])/total_app3
sum(df.start$unique_users[df.start$app3 & df.start$app2])/total_app3
sum(df.start$unique_users[df.start$app3 & df.start$app3])/total_app3

如果我想对包含其他应用程序的数据集进行自动化处理，显然不应该这样做，但如果它有助于解释我的尝试，我想包括到目前为止我一直在做的事情。

提前致谢！

编辑：在示例数据中遗漏了一个重要细节，即应用集可以超过两个（例如，对于同时使用所有三个应用的用户存在一行）。

【问题讨论】：

标签： r matrix heatmap

【解决方案1】：

好的...看来我在长时间阅读后得到了您想要做的事情。这主要是关于数据清理的问题，主要任务是为您的 corplot 获取正确的矩阵。让我们从你的df.start开始吧。

require(stringr) #To handle the app names.
require(magrittr) #Pipe operator.

df.start$appset <- as.character(df.start$appset) %>% str_replace_all('\\[','') %>% str_replace_all('\\[','')
# Remove the annoying '[' and ']' first.

apps <- df.start$appset %>% str_split(';') %>% unlist() %>% unique()
# Get the names of all your apps.

apps.self <- paste(apps,apps,sep = ';')
df.start$appset[match(apps,df.start$appset)] <- apps.self
# Change 'app1' to 'app1;app1' format. 

appset.swap <- sapply(df.start$appset,function(x){paste(rev(unlist(str_split(x,';'))),collapse = ';')})
# Swap the app1;app2 to app2;app1. 

df.start <- rbind(df.start,data.frame(appset = appset.swap,unique_users = df.start$unique_users,row.names = NULL)) %>% unique()
# Assign values to the swapped appset, and merge with df.start. Now the dataframe looks much better.

df.start <- df.start[order(df.start$appset),]
mat <- matrix(df.start$unique_users,nrow = length(apps),ncol = length(apps))
# Arrange your appset alphabetically, and make the matrix.

mat <- sweep(mat,2,colSums(mat),'/')
diag(mat) <- 1
rownames(mat) <- apps
colnames(mat) <- apps
df.end <- as.data.frame(mat)
#Done.

我有点困惑，为什么对角线应该是1。单个应用用户的信息会丢失。

【讨论】：

谢谢！这里有很多对最终解决方案非常有用的好东西。我应该更清楚地说明一件事：应用程序集不只是成对出现。例如，一个应用集可以是 ["app1;app2;app3"]。
最后，对于热图数据，对角线需要为 1。我们不关心有多少使用过 app1 的用户也使用过 app1，我们知道这将永远是 100%。如果它保留了有关使用 app1 的用户的原始数量的信息，这将提供更多信息，但会混淆热图的颜色渐变。
从我必须的格式到矩阵，结果我只需要嵌套的 sapply 调用，如下所示：sapply(apps, function(m) sapply(apps, function(n) sum(df.start$unique_users[grepl(m,df.start$appset) & grepl(n,esets3$appset)])))

【解决方案2】：

大量借鉴冯，但有一些关键更改，这里是完成我的问题的代码：

library(tidyverse)
library(stringr)

# Starting with the data
df.start <- data.frame(appset = c("[app1]","[app2]","[app3]","[app1;app2]","[app2;app3]","[app1;app3]","[app1;app2;app3]"),
                       unique_users = c(1000, 400, 150, 300, 30, 130,10))

# Remove [ ] and " characters first
df.start$appset <- as.character(df.start$appset) %>%
                   str_replace_all('\\[','') %>%
                   str_replace_all('\\]','') %>%
                   str_replace_all('\"','')

# Get unique names of the apps and alphabetize
apps <- df.start$appset %>%
        str_split(';') %>%
        unlist() %>%
        unique() %>%
        sort(decreasing = FALSE)

# Calculate the matrix of overlapping usage
apps.mat <- sapply(apps, function(m) sapply(apps, function(n) sum(df.start$unique_users[grepl(m,df.start$appset) & grepl(n,df.start$appset)])))
# This is the first critical change needed - this approach deals with any
# number of possible apps and combinations of those apps (not just if they 
# are initially reported in pairs.

# Normalize each row by diagonal (e.g. combined usage / total usage per app)  
apps.mat.norm <- sweep(apps.mat,1,diag(apps.mat),'/')
# Second critical change is switching the margin in sweep to 1 (rows) and 
# the stat to diag().  This way each row is normalized by the overlap
# of the apps usage to itself (i.e. total unique users in that app
# regardless of other app usage).  The diagonal should represent 100% 
# overlap between an app and itself.

我认为我需要进行的一些更改是因为我对问题的解释不佳。对此我深表歉意，但绝对感谢在处理我遇到的几个数据管理问题方面的巨大帮助！

【讨论】：