【问题标题】:Count the number of times (frequency) a string occurs计算字符串出现的次数(频率)
【发布时间】:2016-08-21 01:11:42
【问题描述】:

我的数据框中有一列如下

   Col1
   ----------------------------------------------------------------------------
   Center for Animal Control, Division of Hypertension, Department of Medicine
   Department of Surgery, Division of Primary Care, Center for Animal Control
   Department of Internal Medicine, Division of Hypertension, Center for Animal Control

我如何计算以逗号分隔的字符串的数量,换句话说,我想要完成的是如下所示

    Affiliation                         Freq
    ------------------------------------------
    Center for Animal Control           3
    Division of Hypertension            2
    Department of Medicine              1
    Department of Surgery               1
    Division of Primary Care            1
    Department of Internal Medicine     1  

有人可以帮我解决这个问题吗?

【问题讨论】:

  • 您能发布到目前为止您尝试过的内容,以及哪些内容不起作用吗?

标签: r count word-frequency


【解决方案1】:

假设:Center for Animal Control, Division of Hypertension, Department of Medicine 是第 1 行的值,Department of Surgery, Division of Primary Care, Center for Animal Control 是第 2 行的值,依此类推。

df 是数据框。

aff_val <- trimws(unlist(strsplit(df$col1,",")))

ans <- data.frame(table(aff_val))

colnames(ans)[1] <- 'Affiliation'

【讨论】:

  • 这与下面的答案相同。
  • @Gopala 请仔细查看代码。如果我的假设是正确的,那么绝对没有必要像你在第 2 行中所做的那样做 gsub 的事情。那样就不同了。
  • 在输入数据中,有一个新的行分隔科室(不是逗号) - 例如医学部和外科部。他们不会被strsplit(逗号)接听。
  • @Gopala 我认为你无法理解这个假设。我假设数据按行存储在数据框中。您假设它在一行中。我假设记录存储在数据框的不同行中。
  • 明白。很好的解决方案!
【解决方案2】:

这是一种方法。还要用逗号替换'\n',因为您的文本中有一些新行。

df <- data.frame(col1 = rep("Center for Animal Control, Division of Hypertension, Department of Medicine, Department of Surgery, Division of Primary Care, Center for Animal Control, Department of Internal Medicine, Division of Hypertension, Center for Animal Control", 1), stringsAsFactors = FALSE)
df$col1 <- gsub('\\n', ', ', df$col1)
as.data.frame(table(unlist(strsplit(df$col1, ', '))))

输出如下(基于原始数据):

                             Var1 Freq
1       Center for Animal Control    3
2 Department of Internal Medicine    1
3          Department of Medicine    1
4           Department of Surgery    1
5        Division of Hypertension    2
6        Division of Primary Care    1

【讨论】:

  • gsub('\\n', ',', df$col1) 在做什么? \n 作为换行符不需要像 \\n 这样的第二次转义
  • 实际上,这并没有给出正确的结果——“动物控制中心”重复了两次。您需要考虑逗号两侧的空格,例如:data.frame(table(unlist(strsplit(as.character(df$col1), "\\s*,\\s*"))))
  • 兄弟,如果 col1 的值为“高血压司,医学部,动物控制中心”,您的代码将失败,如您的答案,将为`动物控制中心创建一个单独的隶属关系字段`
  • 已编辑以修复额外空间问题。
  • \\n 仍然是不必要的,使用\\s* 表示“0 个空格或更多”而不是单独使用" " 会更安全。
【解决方案3】:

我将scantrimws 用于这些文本处理任务。

inp <- "    Center for Animal Control, Division of Hypertension, Department of Medicine
    Department of Surgery, Division of Primary Care, Center for Animal Control
    Department of Internal Medicine, Division of Hypertension, Center for Animal Control"

> table( trimws(scan(text=inp, what="", sep=",")))
Read 9 items

      Center for Animal Control Department of Internal Medicine 
                              3                               1 
         Department of Medicine           Department of Surgery 
                              1                               1 
       Division of Hypertension        Division of Primary Care 
                              2                               1 

也可以将结果包装成 as.data.frame:

> as.data.frame(table(  trimws(scan(text=inp, what="", sep=","))))
Read 9 items
                             Var1 Freq
1       Center for Animal Control    3
2 Department of Internal Medicine    1
3          Department of Medicine    1
4           Department of Surgery    1
5        Division of Hypertension    2
6        Division of Primary Care    1

【讨论】:

    【解决方案4】:
    library(stringr)
    a<-"Center for Animal Control, Division of Hypertension, Department of Medicine
    Department of Surgery, Division of Primary Care, Center for Animal Control
    Department of Internal Medicine, Division of Hypertension, Center for Animal Control"
    con<-textConnection(a)
    tbl<-read.table(con,sep=",")
    vec<-str_trim(unlist(tbl))
    as.data.frame(table(vec))
    

    答案是

    1       Center for Animal Control    3
    2 Department of Internal Medicine    1
    3          Department of Medicine    1
    4           Department of Surgery    1
    5        Division of Hypertension    2
    6        Division of Primary Care    1
    

    【讨论】:

      【解决方案5】:
      text = "Center for Animal Control, Division of Hypertension, Department of Medicine
      Department of Surgery, Division of Primary Care, Center for Animal Control
      Department of Internal Medicine, Division of Hypertension, Center for Animal Control"
      
      library(stringi)
      library(dplyr)
      library(tidyr)
      
      data_frame(text = text) %>%
        mutate(line = text %>% stri_split_fixed("\n") ) %>%
        unnest(line) %>%
        mutate(phrase = line %>% stri_split_fixed(", ") ) %>%
        unnest(phrase) %>%
        count(phrase)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2012-06-24
        • 1970-01-01
        • 2014-04-24
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多