在 R 中生成唯一子字符串表答案

【问题标题】：Generating a table of unique substrings in R在 R 中生成唯一子字符串表
【发布时间】：2019-11-26 13:11:01
【问题描述】：

所以我有一个非常大的数据集，我想知道一个具有大约 400,000 个观察值的列的唯一值，每个观察值如下所示： identifier:abzcd:def:RANDOMNUMBERSTRING 和 identifier:de:ghijklm:RANDOMNUMBERSTRING。我只想要随机数字符串之前的部分的唯一匹配项。换句话说，我只想过滤掉重复的代码：identifier:LETTERS:LETTERS unique 函数不起作用，看起来我需要确切知道要过滤哪些子字符串或子字符串要使用 substr 函数多长时间。关于如何做到这一点的任何建议？

以下是一些可以作为模型的数据：

randz <- data.frame(id =
                      sprintf("identifier:%s%s%s:%s%s%s:%s",
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(6000:7000, 1000, replace = T )))
randz

【问题讨论】：

标签： r unique substr

【解决方案1】：

这是一种使用tidyverse的简单方法

# Fake Data
randz <- data.frame(id =
                      sprintf("identifier:%s%s%s:%s%s%s:%s",
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(letters, 1000,replace = T ),
                 sample(6000:7000, 1000, replace = T )))

这里我将使用str_remove 函数删除最后一个冒号(:) 之后的数字(\d+)，使用“$”表示字符串的结尾。 Count 还会提取每个唯一的，“n”列将指示它会出现多少次。


# Libraries
library(tidyverse)
randz %>% 
  mutate(out = str_remove(string = id,
                           pattern = ":\\d+$")) %>% 
  count(out,sort = TRUE)

输出：

A tibble: 1,000 x 2
   out                    n
   <chr>              <int>
 1 identifier:aar:muk     1
 2 identifier:abe:tlo     1
 3 identifier:abg:qux     1
 4 identifier:abh:bxx     1
 5 identifier:abl:vdj     1

【讨论】：

【解决方案2】：

您可以使用正则表达式提取它们。这是一个使用 stringr 包的示例。

str_extract("identifier:de:ghijklm:RANDOMNUMBERSTRING", "(identifier\\:[a-z]+\\:[a-z]+)")

【讨论】：