如何将字符串矩阵转换为 0 和 1 的矩阵答案

【问题标题】：How to convert a matrix of strings into matrix of 0 and 1's如何将字符串矩阵转换为 0 和 1 的矩阵
【发布时间】：2013-05-16 20:25:37
【问题描述】：

您好，我有一个如下所示的数据集

name1  a b c d  
name2  a c e g i  
name3  t j i m n z

dput 输出：

structure(c("name1", "name2", "name3", "a ", "a", "r ", "b", "c", "k ", "c", "e", "l", "d", "t", "o", "e", "j", "m", "", "k", "n"), .Dim = c(3L, 7L), .Dimnames = list(NULL, c("V1", "V2", "V3", "V4", "V5", "V6", "V7")))

我想像这样转换成矩阵

         a b c d e g i j m n t z
name1    1 1 1 1 0 0 0 0 0 0 0 0
name2    1 0 1 0 1 1 1 0 0 0 0 0 
name3    0 0 0 0 0 0 1 1 1 1 1 1

如何在 R 中做到这一点？

【问题讨论】：

names 是行名还是列？
名称是行名，列名来自数据集中存在的所有值的组合
我的意思是原始数据。它基本上相当于@Joran 所说的。这是非常可行的，但方法取决于原始数据的结构

标签： r matrix reshape2

【解决方案1】：

## Assuming this is your starting data
dat <- read.table(text="name1  a b c d  NA NA\nname2  a c e g i NA\nname3  t j i m n z")
rownames(dat) <- dat$V1 
dat$V1 <- NULL

我假设您的数据类似于上面的内容。

## store the rownames
NM <- rownames(dat)  # or NM <- c("name1", "name2", "name3")

## IMPORTANT. Make sure you have characters, not factors. 
dat <- sapply(dat, as.character)

cols <- sort(unique(as.character(unlist(dat))))

results <- sapply(cols, function(cl) apply(dat, 1, `%in%`, x=cl))
results[] <- as.numeric(results)

rownames(results) <- NM

results

      a b c d e g i j m n t z
name1 1 1 1 1 0 0 0 0 0 0 0 0
name2 1 0 1 0 1 1 1 0 0 0 0 0
name3 0 0 0 0 0 0 1 1 1 1 1 1

【讨论】：

10521402_C_T一个10521502_A_T b 10521576_G_AÇ10521624_G_A d 10521769_G_TË10521798_T_A˚F10521913_C_T一个10523162_T_C b 10523562_C_T一个10527303_T_C吨10529333_C_Aħ10529384_G_C˚F10529384_G_C一个10527303_T_C C |桌子看起来像这样。嗨，我真的很困惑如何完成这一步。由于它包含重复的行，我需要将它们显示为一行，其中两个不同的列有两个。你能看看它，让我知道这方面的任何建议吗？
嗨@dissw.geek9。 cmets 中的代码很难阅读。如果您有后续问题，那么打开一个新问题并简单地说明它是对这个问题的回应或跟进是完全可以的。（在原始问题的底部有一个“分享”按钮，您可以使用它链接回该问题）
嗨，感谢您的回复。我找到了解决我所面临问题的方法。
@RicardoSaporta，我知道这是旧的，但您可能对the benchmarks in my answer 感兴趣。有趣的东西！

【解决方案2】：

这是一种方法：

qw = function(s) unlist(strsplit(s,'[[:blank:]]+'))
name1 <- qw("a b c d")
name2 <- qw("a c e g i")
name3 <- qw("t j i m n z")

rows <- qw("name1 name2 name3")
cols <- sort(unique(c(name1,name2,name3)))

nr <- length(rows)
nc <- length(cols)

outmat <- matrix(0,nr,nc,dimnames=list(rows,cols))

for (i in rows){
    outmat[i,get(i)] <- 1
}

#       a b c d e g i j m n t z
# name1 1 1 1 1 0 0 0 0 0 0 0 0
# name2 1 0 1 0 1 1 1 0 0 0 0 0
# name3 0 0 0 0 0 0 1 1 1 1 1 1

函数qw 并不是必需的，但在您的示例中更易于阅读。

【讨论】：

对于文本阅读，您可以简单地使用read.table(text=..)。它会自动在空白处拼接列

【解决方案3】：

更新：更快的替代方案

您将通过使用矩阵索引获得最佳速度。这是一个示例（使用 cmets，因此您可以看到发生了什么）。

## Assuming this is your starting data
dat <- read.table(text="name1  a b c d  NA NA\nname2  a c e g i NA\nname3  t j i m n z")
rownames(dat) <- dat$V1 
dat$V1 <- NULL

## Convert the data.frame into a single character vector
A <- unlist(lapply(dat, as.character), use.names = FALSE)

## Identify the unique levels
levs <- sort(unique(na.omit(A)))

## Get the index position for the Row/Column combination
##   that needs to be recoded as "1"
Rows <- rep(sequence(nrow(dat)), ncol(dat))
Cols <- match(A, levs)

## Create an empty matrix
m <- matrix(0, nrow = nrow(dat), ncol = length(levs),
            dimnames = list(rownames(dat), levs))

## Use matrix indexing to replalce the relevant values with 1
m[cbind(Rows, Cols)] <- 1L
m
#       a b c d e g i j m n t z
# name1 1 1 1 1 0 0 0 0 0 0 0 0
# name2 1 0 1 0 1 1 1 0 0 0 0 0
# name3 0 0 0 0 0 0 1 1 1 1 1 1

基准测试

在创建初始 data.frame 的 30000 行版本后，我对 Ricardo 的答案、我的 data.table 答案和矩阵索引答案进行了基准测试。结果如下：

dat2 <- dat ## A backup
dat <- do.call(rbind, replicate(10000, dat, simplify = FALSE))
dim(dat)
# [1] 30000     6

library(microbenchmark)
microbenchmark(AM(), AMDT(), RS(), times = 10)
# Unit: milliseconds
#    expr        min         lq     median        uq       max neval
#    AM()   44.30915   56.21873   57.95815   86.1518  265.3053    10
#  AMDT()  231.71928  245.64236  291.19601  376.8983  515.8216    10
#    RS() 4414.01127 4698.47293 4731.72877 5484.6185 5726.8092    10

矩阵索引显然胜出，但考虑到 data.table 语法的简洁性来完成工作，我更喜欢这种方法！ @Arun 将 Hadley 的“reshape2”工作移植到 data.table@!!!

原答案

这是一个“data.table”替代方案。它至少需要 1.8.11 版的“data.table”。

加载所需的包

library(data.table)
library(reshape2)
packageVersion("data.table")
# [1] ‘1.8.11’

`melt` 和 `cast` 你的 `data.table`

DT <- data.table(dat, keep.rownames=TRUE)
dcast.data.table(melt(DT, id.vars="rn"), rn ~ value)
# Aggregate function missing, defaulting to 'length'
#       rn NA a b c d e g i j m n t z
# 1: name1  0 1 1 1 1 0 0 0 0 0 0 0 0
# 2: name2  0 1 0 1 0 1 1 1 0 0 0 0 0
# 3: name3  0 0 0 0 0 0 0 1 1 1 1 1 1

【讨论】：

更新：更快的替代方案

基准测试

原答案

加载所需的包

melt 和 cast 你的 data.table

`melt` 和 `cast` 你的 `data.table`