【问题标题】:summarize from string matches从字符串匹配中总结
【发布时间】:2019-02-20 13:38:27
【问题描述】:

我有这个 df 列:

df <- data.frame(Strings = c("ñlas onepojasd", "onenañdsl", "ñelrtwofkld", "asdthreeasp", "asdfetwoasd", "fouroqwke","okasdtwo", "acmofour", "porefour", "okstwo"))
> df
          Strings
1  ñlas onepojasd
2       onenañdsl
3     ñelrtwofkld
4     asdthreeasp
5     asdfetwoasd
6       fouroqwke
7        okasdtwo
8        acmofour
9        porefour
10         okstwo

我知道df$Strings 中的每个值都将与单词one, two, three or four 匹配。而且我也知道它只会与其中一个词匹配。所以要匹配它们:

str_detect(df$Strings,"one")
str_detect(df$Strings,"two")
str_detect(df$Strings,"three")
str_detect(df$Strings,"four")

但是,我被困在这里,因为我正在尝试做这张桌子:

Homes  Quantity Percent
  One         2     0.3
  Two         4     0.4
Three         1     0.1
 Four         3     0.3
Total        10       1

【问题讨论】:

    标签: r string summary


    【解决方案1】:

    使用tidyversejanitor,您可以:

    df %>%
     mutate(Homes = str_extract(Strings, "one|two|three|four"),
            n = n()) %>%
     group_by(Homes) %>%
     summarise(Quantity = length(Homes),
               Percent = first(length(Homes)/n)) %>%
     adorn_totals("row")
    
     Homes Quantity Percent
      four        3     0.3
       one        2     0.2
     three        1     0.1
       two        4     0.4
     Total       10     1.0
    

    或者只用tidyverse:

     df %>%
     mutate(Homes = str_extract(Strings, "one|two|three|four"),
            n = n()) %>%
     group_by(Homes) %>%
     summarise(Quantity = length(Homes),
               Percent = first(length(Homes)/n)) %>%
     rbind(., data.frame(Homes = "Total", Quantity = sum(.$Quantity), 
                         Percent = sum(.$Percent)))
    

    在这两种情况下,代码首先提取匹配模式并计算案例数量。其次,它按匹配的单词分组。第三,它计算每个单词的案例数以及给定单词在所有单词中的比例。最后,它添加了一个“总计”行。

    【讨论】:

      【解决方案2】:

      您可以使用str_extract,然后执行tableprop.table,即

      library(stringr)
      
      str_extract(df1$Strings, 'one|two|three|four')
      #[1] "one"   "one"   "two"   "three" "two"   "four"  "two"   "four"  "four"  "two"  
      
      table(str_extract(df1$Strings, 'one|two|three|four'))
      # four   one three   two 
      #    3     2     1     4 
      
      prop.table(table(str_extract(df1$Strings, 'one|two|three|four')))
      # four   one three   two 
      #  0.3   0.2   0.1   0.4 
      

      【讨论】:

        【解决方案3】:

        base R 选项将是 regmatches/regexprtable

        table(regmatches(df$Strings, regexpr('one|two|three|four', df$Strings)))
        #  four   one three   two 
        #    3     2     1     4 
        

        添加addmargins 得到sum 然后除以它

        out <- addmargins(table(regmatches(df$Strings, 
             regexpr('one|two|three|four', df$Strings))))
        out/out[length(out)]
        
        # four   one three   two   Sum 
        #  0.3   0.2   0.1   0.4   1.0 
        

        【讨论】:

          猜你喜欢
          • 2021-11-06
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2017-12-14
          • 1970-01-01
          • 2014-03-03
          • 2022-12-04
          相关资源
          最近更新 更多