在 R 中，如何找到给定数据框中的连接数并生成表示它的变量？答案

【问题标题】：In R how can I find the number of connections I have in a given dataframe and produce a variable representing it?在 R 中，如何找到给定数据框中的连接数并生成表示它的变量？
【发布时间】：2019-05-12 20:21:07
【问题描述】：

所以我目前有一个代表如下社交网络的数据框：

id age  id1    id2   id3   
01  14  02      05    03        
02  23  01      05    03        
03  52  04      01    02        
04  41  03                      
05  32  01      02

理想情况下，我想要一个如下所示的新数据框：

id age  id1    id2   id3   Connections
01  14  02      05    03        3
02  23  01      05    03        3
03  52  04      01    02        3
04  41  03                      1
05  32  01      02              2

使用新变量表示“id”具有的连接数。截至目前，我目前有如下代码：

links <- df
links <- as.matrix(links)
links <- as.data.frame(rbind(links[,c(1,3)], links[,c(1,4)]), links[,c(1,5)])
head(links)

library(igraph)
g = graph.data.frame(links)
m = as.matrix(get.adjacency(g))
m
pmax(rowSums(m), colSums(m))

这给了我：

 1  2  3  4  5 NA 
 3  3  3  1  2  3

然后如何将其合并到数据框中以创建“连接”变量？理想情况下，我的其他数据最多包含 50 个连接，因此我想要一种更简单的方法，无需重新创建数据框。

【问题讨论】：

也许吧？ df$connections <- rowSums(!is.na(df[, c("id1", "id2", "id3")])) 或更灵活：df$connections <- rowSums(!is.na(df[, grepl("id[0-9]+", names(df)]))

标签： r dataframe igraph

【解决方案1】：

一种快速的tidyverse 方法是将数据重新整形为长形，将每个ID 有多少个非NA 值相加，然后重新整形为宽形。

library(tidyverse)

df %>%
  gather(key = key, value = val, -id, -age) %>%
  group_by(id, age) %>%
  mutate(connections = sum(!is.na(val))) %>%
  head()
#> # A tibble: 6 x 5
#> # Groups:   id, age [5]
#>   id      age key   val   connections
#>   <chr> <dbl> <chr> <chr>       <int>
#> 1 01       14 id1   02              3
#> 2 02       23 id1   01              3
#> 3 03       52 id1   04              3
#> 4 04       41 id1   03              1
#> 5 05       32 id1   01              2
#> 6 01       14 id2   05              3

df %>%
  gather(key = key, value = val, -id, -age) %>%
  group_by(id, age) %>%
  mutate(connections = sum(!is.na(val))) %>%
  spread(key = key, value = val)
#> # A tibble: 5 x 6
#> # Groups:   id, age [5]
#>   id      age connections id1   id2   id3  
#>   <chr> <dbl>       <int> <chr> <chr> <chr>
#> 1 01       14           3 02    05    03   
#> 2 02       23           3 01    05    03   
#> 3 03       52           3 04    01    02   
#> 4 04       41           1 03    <NA>  <NA> 
#> 5 05       32           2 01    02    <NA>

但我不会认为您的第一种方法是错误的。由于您正在使用网络，因此使用网络分析工具并计算每个节点的度数（与连接数相同）是有意义的。

【讨论】：

【解决方案2】：

library(dplyr)
# Toy data
df = data.frame(id = c(1,2,3,4), 
                age = c(1, 1, 1, 1), 
                id1 = c(1, 2, 3, 4), 
                id2 = c(1, 2, 3, NA), 
                id3 = c(1,2, NA, NA))

df$Connections = df %>%
  select(-id, -age) %>% # Remove unnecessary columns
  apply(1, function(row) {
    binary_row = as.numeric(!is.na(row)) # Convert each column to binary
    sum(binary_row) # Return connection count
  })

【讨论】：

【解决方案3】：

这样的事情怎么样：

首先，使用regex我们确定连接对应的列

# here connections columns must contain the pattern "id"+digit(s)
connectionsNames <- grepl("id\\d+", names(df), perl = TRUE)

然后我们使用rowSums创建新列

df$connections <- sum(connectionsNames) - rowSums(is.na(df))

这里是结果

df
  id age id1 id2 id3 connections
1  1   1   1   1   1           3
2  2   1   2   2   2           3
3  3   1   3   3  NA           2
4  4   1   4  NA  NA           1

【讨论】：