从R中数据框中的单元格中提取数字字符答案

【问题标题】：Extract numeric characters from cells in a data frame in R从R中数据框中的单元格中提取数字字符
【发布时间】：2019-01-24 11:29:22
【问题描述】：

我正在尝试从这样的数据框中提取数值：

ID Secc                     col1      col2        col3
 1 Sección 0805601006       1400      1300        85*      
 2 Sección 0805601007       1475      1365        5.0     
 3 Sección 0805601005       760       760         0.0      
 4 Sección 0805601003       1335      1335        0.0      
 5 Sección 0805601002       655       655         0.0      
 6 Sección 0805601004       900       815         85*

要获得一个“干净”的数据框，只包含这样的数字字符：

    ID Secc             col1      col2       col3
     1 0805601006       1400      1300       85      
     2 0805601007       1475      1365       5.0     
     3 0805601005       760       760        0.0      
     4 0805601003       1335      1335       0.0      
     5 0805601002       655       655        0.0      
     6 0805601004       900       815        85

我一直在尝试extract_numeric, st_replace, gsub 等许多功能，但无法获得我想要的结果。

有人知道如何清理我的数据吗？

【问题讨论】：

as.numeric(substr(df$Secc, 8, length(df$Secc)))?
显示您期望得到的结果会很有帮助...
Extracting numbers from vectors of strings的可能重复

标签： r extract data-science data-cleaning

【解决方案1】：

让我们想一个更通用的方法。数字可以是负数 (-)。

我稍微改变了数据。

    df1 <- read.table(text="ID Secc                     col1      col2        col3
1 'Sección 0805601006'       1400      1300        85*      
                  2 'Sección 0805601007'       -14rofl75      1365        5.0     
                  3 'Sección 0805601005'       760       760         0.0      
                  4 'Sección 0805601003'       1-3-3-5      1335        0.0      
                  5 'Sección 0805601002'       -655       HEHE-655         0.0      
                  6 'Sección 0805601004'       900       815         85*",h=T,strin=F)

代码：

fun1 <- function(x) {
    ge<-gregexpr("(^-?|(?<=\\D)-)?(\\d\\.?\\d*?)+",x,perl=T)
    return(as.numeric(sapply(regmatches(x,ge),paste0,collapse="")))
    }
df1[] <- lapply(df1,fun1)

结果：

#  ID       Secc  col1 col2 col3
#1  1 0805601006  1400 1300   85
#2  2 0805601007 -1475 1365    5
#3  3 0805601005   760  760    0
#4  4 0805601003  1335 1335    0
#5  5 0805601002  -655 -655    0
#6  6 0805601004   900  815   85

【讨论】：

【解决方案2】：

你可以使用readr::parse_number：

library(readr)
df1[] <- lapply(df1, parse_number)
df1
#   ID     Secc col1 col2 col3
# 1  1 8.06e+08 1400 1300   85
# 2  2 8.06e+08 1475 1365    5
# 3  3 8.06e+08  760  760    0
# 4  4 8.06e+08 1335 1335    0
# 5  5 8.06e+08  655  655    0
# 6  6 8.06e+08  900  815   85

sapply(df1,class)
#        ID      Secc      col1      col2      col3 
# "numeric" "numeric" "numeric" "numeric" "numeric"

在 tidyspeak 中，使用 df1 %>% mutate_all(parse_number)

这是基础 R 中的一种方式（相同的输出）：

df1[] <-lapply(df1, function(x) as.numeric(gsub("(?![\\.-])\\D","",x, perl=T)))

注意：tidyr::extract_numeric 也可以使用，但不推荐使用 readr::parse_number。

数据

df1 <- read.table(text="ID Secc                     col1      col2        col3
1 'Sección 0805601006'       1400      1300        85*      
2 'Sección 0805601007'       1475      1365        5.0     
3 'Sección 0805601005'       760       760         0.0      
4 'Sección 0805601003'       1335      1335        0.0      
5 'Sección 0805601002'       655       655         0.0      
6 'Sección 0805601004'       900       815         85*",h=T,strin=F)

【讨论】：