【问题标题】：Extracting unique numbers from string in R从R中的字符串中提取唯一数字
【发布时间】：2013-06-05 06:25:11
【问题描述】：

我有一个包含随机字符的字符串列表，例如：

list=list()
list[1] = "djud7+dg[a]hs667"
list[2] = "7fd*hac11(5)"
list[3] = "2tu,g7gka5"

我想知道此列表中哪些数字至少出现一次 (unique())。我的例子的解决方案是：

解决方案：c(7,667,11,5,2)

如果有人有一种方法不将 11 视为“十一”而是“一加一”，它也会很有用。这种情况下的解决方案是：

解决方案：c(7,6,1,5,2)

（我在相关主题上找到了这篇文章：Extracting numbers from vectors of strings）

【问题讨论】：

标签： r regex

【解决方案1】：

查看strex 包中的str_extract_numbers() 函数。

pacman::p_load(strex)
list=list()
list[1] = "djud7+dg[a]hs667"
list[2] = "7fd*hac11(5)"
list[3] = "2tu,g7gka5"
charvec <- unlist(list)
print(charvec)
#> [1] "djud7+dg[a]hs667" "7fd*hac11(5)"     "2tu,g7gka5"
str_extract_numbers(charvec)
#> [[1]]
#> [1]   7 667
#> 
#> [[2]]
#> [1]  7 11  5
#> 
#> [[3]]
#> [1] 2 7 5
unique(unlist(str_extract_numbers(charvec)))
#> [1]   7 667  11   5   2

由reprex package (v0.2.0) 于 2018 年 9 月 3 日创建。

【讨论】：

【解决方案2】：

使用stringi的解决方案

 # extract the numbers:

 nums <- stri_extract_all_regex(list, "[0-9]+")

 # Make vector and get unique numbers:

 nums <- unlist(nums)
 nums <- unique(nums)

这是您的第一个解决方案

对于第二种解决方案，我将使用substr：

nums_first <- sapply(nums, function(x) unique(substr(x,1,1)))

【讨论】：

stri_extract_all_regex() 来自哪里？
stringi 包，就像标题中所说的那样

【解决方案3】：

带有str_match_all 和管道运算符的stringr 解决方案。对于第一个解决方案：

library(stringr)
str_match_all(ll, "[0-9]+") %>% unlist %>% unique %>% as.numeric

第二种解决方案：

str_match_all(ll, "[0-9]") %>% unlist %>% unique %>% as.numeric

（注意：我也称列表为ll）

【讨论】：

【解决方案4】：

这是另一个答案，这个答案使用gregexpr 查找数字，并使用regmatches 提取它们：

l <- c("djud7+dg[a]hs667", "7fd*hac11(5)", "2tu,g7gka5")

temp1 <- gregexpr("[0-9]", l)   # Individual digits
temp2 <- gregexpr("[0-9]+", l)  # Numbers with any number of digits

as.numeric(unique(unlist(regmatches(l, temp1))))
# [1] 7 6 1 5 2
as.numeric(unique(unlist(regmatches(l, temp2))))
# [1]   7 667  11   5   2

【讨论】：

【解决方案5】：

您可以使用?strsplit（就像@Arun 在Extracting numbers from vectors (of strings) 中的回答中所建议的那样）：

l <- c("djud7+dg[a]hs667", "7fd*hac11(5)", "2tu,g7gka5")

## split string at non-digits
s <- strsplit(l, "[^[:digit:]]")

## convert strings to numeric ("" become NA)
solution <- as.numeric(unlist(s))

## remove NA and duplicates
solution <- unique(solution[!is.na(solution)])
# [1]   7 667  11   5   2

【讨论】：

【解决方案6】：

对于第二个答案，您可以使用gsub 从字符串中删除所有不是数字的内容，然后将字符串拆分如下：

unique(as.numeric(unlist(strsplit(gsub("[^0-9]", "", unlist(ll)), ""))))
# [1] 7 6 1 5 2

对于第一个答案，同样使用strsplit，

unique(na.omit(as.numeric(unlist(strsplit(unlist(ll), "[^0-9]+")))))
# [1]   7 667  11   5   2

PS：不要将变量命名为 list（因为有一个内置函数 list）。我已将您的数据命名为ll。

【讨论】：

【解决方案7】：

使用 strsplit 使用模式作为数字的倒数：0-9

对于您提供的示例，请执行以下操作：

tmp <- sapply(list, function (k) strsplit(k, "[^0-9]"))

然后简单地取列表中所有“集合”的并集，如下所示：

tmp <- Reduce(union, tmp)

那么你只需要删除空字符串。

【讨论】：

一分钟内三个相同的答案！ :D
strsplit 是矢量化的。您可以/应该通过unlisting OP 的数据来避免使用循环。