【问题标题】:Extracting numbers from a string in a dataframe从数据框中的字符串中提取数字
【发布时间】:2016-06-12 19:19:23
【问题描述】:

我希望有人能够向我展示一种从字符向量中提取数据的方法。

数据框如下

structure(list(Sensitivity = structure(c(1L, 5L, 4L, 4L, 4L, 
4L, 3L, 5L, 2L), .Label = c("    1.01 [ 0.21, 2.91]", "   89.60 [ 85.56, 92.82]", 
"   92.95 [ 89.43, 95.59]", "   99.66 [ 98.14, 99.99]", "  100.00 [ 98.77, 100.00]"
), class = "factor"), Specificity = structure(c(8L, 1L, 3L, 4L, 
2L, 5L, 6L, 1L, 7L), .Label = c("   27.17 [ 25.15, 29.26]", "   44.96 [ 42.67,   47.26]", 
"   53.31 [ 51.00, 55.61]", "   69.90 [ 67.75, 71.99]", "   70.23 [ 68.08, 72.31]", 
"   90.18 [ 88.73, 91.50]", "   91.70 [ 90.35, 92.92]", "  100.00 [ 99.80, 100.00]"
), class = "factor")), .Names = c("Sensitivity", "Specificity"
), class = "data.frame", row.names = c(NA, -9L))

以第一列的第一列元素为例,理想情况下,我会得到三列数据,分别为 1.01、0.21 和 2.91。

第一个和第二个数值用“[”分隔,第二个和第三个用“,”分隔。我对 grep 不是很熟悉,但我尝试过使用,但在某个地方出错了!

【问题讨论】:

    标签: regex r dataframe


    【解决方案1】:

    这是一个正则表达式解决方案,您可以尝试使用来自stringr 包的str_extract_all,其中我们使用\\d+\\.\\d+ 来匹配从一个或多个数字开始的十进制数字,后跟. 和另一个或更多的数字模式。

    library(stringr)
    lapply(df, function(col) do.call(rbind, str_extract_all(col, "\\d+\\.\\d+")))
    
    $Sensitivity
          [,1]     [,2]    [,3]    
     [1,] "1.01"   "0.21"  "2.91"  
     [2,] "100.00" "98.77" "100.00"
     [3,] "99.66"  "98.14" "99.99" 
     [4,] "99.66"  "98.14" "99.99" 
     [5,] "99.66"  "98.14" "99.99" 
     [6,] "99.66"  "98.14" "99.99" 
     [7,] "92.95"  "89.43" "95.59" 
     [8,] "100.00" "98.77" "100.00"
     [9,] "89.60"  "85.56" "92.82" 
    
    $Specificity
          [,1]     [,2]    [,3]    
     [1,] "100.00" "99.80" "100.00"
     [2,] "27.17"  "25.15" "29.26" 
     [3,] "53.31"  "51.00" "55.61" 
     [4,] "69.90"  "67.75" "71.99" 
     [5,] "44.96"  "42.67" "47.26" 
     [6,] "70.23"  "68.08" "72.31" 
     [7,] "90.18"  "88.73" "91.50" 
     [8,] "27.17"  "25.15" "29.26" 
     [9,] "91.70"  "90.35" "92.92" 
    

    【讨论】:

    • 这真的很整洁 - 我能问一下 "\\d+\\.\\d+" 是什么吗?猜“。”是小数点,但是 \\d+ 做什么?
    • \\d+ 匹配一位或多位数字。即[0-9]。其中\\d 代表数字,+ 代表出现一次或多次。
    【解决方案2】:

    试试这个:

    cbind(
     matrix(as.numeric(unlist(strsplit(unlist(strsplit(gsub("]","",
              dat$Sensitivity), ",")),"\\["))),ncol=3,byrow = T)
     ,
     matrix(as.numeric(unlist(strsplit(unlist(strsplit(gsub("]","",
              dat$Specificity), ",")),"\\["))),ncol=3,byrow = T)
    )
    
            [,1]  [,2]   [,3]   [,4]  [,5]   [,6]
     [1,]   1.01  0.21   2.91 100.00 99.80 100.00
     [2,] 100.00 98.77 100.00  27.17 25.15  29.26
     [3,]  99.66 98.14  99.99  53.31 51.00  55.61
     [4,]  99.66 98.14  99.99  69.90 67.75  71.99
     [5,]  99.66 98.14  99.99  44.96 42.67  47.26
     [6,]  99.66 98.14  99.99  70.23 68.08  72.31
     [7,]  92.95 89.43  95.59  90.18 88.73  91.50
     [8,] 100.00 98.77 100.00  27.17 25.15  29.26
     [9,]  89.60 85.56  92.82  91.70 90.35  92.92
    

    【讨论】:

      【解决方案3】:

      这是一个使用base R 提取类型为numeric 的数字部分的选项

      lst <- lapply(d1, function(x) read.csv(text=gsub("[][]", ", ", x), header=FALSE)[-4])
      lst
      #$Sensitivity
      #      V1    V2     V3
      #1   1.01  0.21   2.91
      #2 100.00 98.77 100.00
      #3  99.66 98.14  99.99
      #4  99.66 98.14  99.99
      #5  99.66 98.14  99.99
      #6  99.66 98.14  99.99
      #7  92.95 89.43  95.59
      #8 100.00 98.77 100.00
      #9  89.60 85.56  92.82
      
      #$Specificity
      #      V1    V2     V3
      #1 100.00 99.80 100.00
      #2  27.17 25.15  29.26
      #3  53.31 51.00  55.61
      #4  69.90 67.75  71.99
      #5  44.96 42.67  47.26
      #6  70.23 68.08  72.31
      #7  90.18 88.73  91.50
      #8  27.17 25.15  29.26
      #9  91.70 90.35  92.92
      

      如果需要,data.frames 的list 可以通过cbinding 转换为单个data.frame

      do.call(cbind, lst)
      

      【讨论】:

        猜你喜欢
        • 2015-05-31
        • 2017-04-19
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-06-03
        • 2021-12-26
        • 2021-01-31
        • 2019-05-21
        相关资源
        最近更新 更多