R提取字符串中的特定文本答案

【问题标题】：R extract specific text inside a stringR提取字符串中的特定文本
【发布时间】：2020-12-30 18:53:03
【问题描述】：

我有一个包含 100 万行的 data.table，每个单元格如下所示：

ENST00000408384 // ENSEMBL // ncrna:miRNA 染色体:GRCh37:1:30366:30503:1 基因:ENSG00000221311 gene_biotype:miRNA transcript_biotype:miRNA // chr1 // 100 // 100 // 0 // --- / / 0 /// ENST00000469289 // ENSEMBL // 哈瓦那：已知染色体：GRCh38:1:30267:31109:1 基因：ENSG00000243485 基因_生物型：lincRNA 转录本_生物型：lincRNA // chr1 // 100 // 100 // 0 // -- - // 0 /// ENST00000473358 // ENSEMBL // 哈瓦那：已知染色体：GRCh38:1:29554:31097:1 基因：ENSG00000243485 gene_biotype:lincRNA transcript_biotype:lincRNA // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002840 // 哈瓦那转录本 // 新转录本[gene_biotype:lincRNA transcript_biotype:lincRNA] // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002841 // 哈瓦那转录本 // 新转录本[gene_biotype:lincRNA transcript_biotype:lincRNA] // chr1 // 100 // 100 // 0 // --- // 0

我需要提取紧跟在“gene_biotype:”之后的内容（在本例中是“miRNA”）。该怎么做？

我尝试使用 stringR 和 regex 找到解决方案，并在几个小时后放弃了。感谢你的帮助。谢谢。

【问题讨论】：

还有gene_biotype:lincRNA。你不想也拿那个吗？
当然，这也是必要的。感谢您的评论。

标签： r regex string stringr

【解决方案1】：

你可以试试regmatches 和regexpr。

regmatches(x, regexpr("(?<=gene_biotype\\:)\\w*", x, perl=TRUE))
# [1] "miRNA"

数据：

x <- "
ENST00000408384 // ENSEMBL // ncrna:miRNA chromosome:GRCh37:1:30366:30503:1 gene:ENSG00000221311 gene_biotype:miRNA transcript_biotype:miRNA // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000469289 // ENSEMBL // havana:known chromosome:GRCh38:1:30267:31109:1 gene:ENSG00000243485 gene_biotype:lincRNA transcript_biotype:lincRNA // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000473358 // ENSEMBL // havana:known chromosome:GRCh38:1:29554:31097:1 gene:ENSG00000243485 gene_biotype:lincRNA transcript_biotype:lincRNA // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002840 // Havana transcript // novel transcript[gene_biotype:lincRNA transcript_biotype:lincRNA] // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002841 // Havana transcript // novel transcript[gene_biotype:lincRNA transcript_biotype:lincRNA] // chr1 // 100 // 100 // 0 // --- // 0
"

【讨论】：

【解决方案2】：

我们可以通过str_extract 使用正则表达式环视

library(stringr)
str_extract(df1$col1, "(?<=gene_biotype:)\\w+")
#[1] "miRNA"

如果我们需要所有元素，请使用str_extract_all

str_extract_all(df1$col1, "(?<=gene_biotype:)\\w+")
#[[1]]
#[1] "miRNA"   "lincRNA" "lincRNA" "lincRNA" "lincRNA"

数据

df1 <- structure(list(col1 = "\nENST00000408384 // ENSEMBL // ncrna:miRNA chromosome:GRCh37:1:30366:30503:1 gene:ENSG00000221311 gene_biotype:miRNA transcript_biotype:miRNA // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000469289 // ENSEMBL // havana:known chromosome:GRCh38:1:30267:31109:1 gene:ENSG00000243485 gene_biotype:lincRNA transcript_biotype:lincRNA // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000473358 // ENSEMBL // havana:known chromosome:GRCh38:1:29554:31097:1 gene:ENSG00000243485 gene_biotype:lincRNA transcript_biotype:lincRNA // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002840 // Havana transcript // novel transcript[gene_biotype:lincRNA transcript_biotype:lincRNA] // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002841 // Havana transcript // novel transcript[gene_biotype:lincRNA transcript_biotype:lincRNA] // chr1 // 100 // 100 // 0 // --- // 0\n"), class = "data.frame", row.names = c(NA, 
-1L))

【讨论】：

【解决方案3】：

也许不是最好的，但效果很好。

splited <- strsplit(text," ")[[1]]
splited <-gsub("*.gene_biotype:","",splited )
unique(gsub("gene_biotype:","",splited[grepl("gene_biotype:",splited)]))

给予，

"miRNA"   "lincRNA"

数据：

text <- "ENST00000408384 // ENSEMBL // ncrna:miRNA chromosome:GRCh37:1:30366:30503:1 gene:ENSG00000221311 gene_biotype:miRNA transcript_biotype:miRNA // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000469289 // ENSEMBL // havana:known chromosome:GRCh38:1:30267:31109:1 gene:ENSG00000243485 gene_biotype:lincRNA transcript_biotype:lincRNA // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000473358 // ENSEMBL // havana:known chromosome:GRCh38:1:29554:31097:1 gene:ENSG00000243485 gene_biotype:lincRNA transcript_biotype:lincRNA // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002840 // Havana transcript // novel transcript[gene_biotype:lincRNA transcript_biotype:lincRNA] // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002841 // Havana transcript // novel transcript[gene_biotype:lincRNA transcript_biotype:lincRNA] // chr1 // 100 // 100 // 0 // --- // 0"

【讨论】：