【问题标题】:R cut function for strings字符串的 R 切函数
【发布时间】:2015-10-23 12:19:17
【问题描述】:

我想知道R中是否有任何类似于cut函数但适用于字符串的函数?

我正在处理的代码是将有关美国各州的数据分配给一个名为 Region 的分类变量:有 4 个区域,Northeast、Midwest、South、West。存储每个数据的数据框将州名称存储在一个名为“state”的变量中,并以它的缩写形式:例如,“NY”代表纽约,“MS”代表密西西比州。需要将区域变量添加到数据框中,我目前正在执行以下操作(这是为了家庭作业,所以我想表明我已经有了一个解决方案,并且正在寻找一个可能更好的解决方案):

create.region <- function(state) {
northeast <- c("CT", "ME", "MA", "NH", "RI", "VT", "NJ", "NY", "PA")
midwest <- c("IL", "IN", "MI", "OH", "WI", "IA", "KS", "MN", "MO", "NE", "ND", "SD")
south <- c("DE", "DC", "FL", "GA", "MD", "NC", "SC", "VA", "WV", "AL", "KY", "TN", "MS", "AR", "LA", "OK", "TX")
west <- c("AZ", "CO", "ID", "MT", "NV", "NM", "UT", "WY", "AK", "CA", "HI", "OR", "WA")
region <- ifelse(state %in% northeast, "Northeast", 
          ifelse(state %in% midwest, "Midwest",
          ifelse(state %in% south, "South",
          ifelse(state %in% west, "West", NA ))))
return(region)
}
birth_data <- within(birth_data, region <- create.region(state))

我对 R 的了解还不是很多,而且我很关心我的代码的效率。过去我发现 cut 函数是一种以这种方式对数字数据进行分类的更简洁有效的方法,但它显然不适用于字符向量。是否有任何类似于 cut 的函数允许字符分配规则,而不仅仅是数字规则?

【问题讨论】:

    标签: r performance


    【解决方案1】:

    最简单的方法是通过向量映射名称。

    首先,我们准备地图:

    all_states = c('northeast', 'midwest', 'south', 'west')
    
    states_for_region = function (region) {
        states = get(region)
        setNames(rep(region, length(states)), states)
    }
    
    states_map = unlist(lapply(all_states, states_for_region))
    

    我们也可以为每个区域手动构建states_map,然后连接结果。但上面的重复性较少。

    然后,我们进行实际的映射,现在只需要一行。

    region = states_map[state]
    

    为了提高效率,最好在函数的外部准备地图。否则,无论何时调用该函数,它都会重新生成。

    【讨论】:

      【解决方案2】:

      开箱即用的 R 包含变量 state.abbstate.region。前者是所有状态缩写的字符向量,后者是包含相应区域的相同长度的4级因子;因此,要获得 MS 的区域,请说:

      state.region[state.abb == "MS"]
      ## [1] South
      ## Levels: Northeast South North Central West
      

      如果您想要一个不同的分类,定义您自己的 state.region 替代品然后使用上面的代码会很容易。

      顺便说一句,请注意state.name 也存在,它与上述两个变量的长度相同,并给出了完整的状态名称。

      【讨论】:

        【解决方案3】:

        您还可以将levels&lt;- 函数与映射列表一起使用。

        这是一个例子:

        ## Create your mapping....
        ## Overkill in this example as @Grothendieck has pointed out,
        ##   but still applicable in a general scenario
        
        myLevs <- list(
          Northeast = c("CT", "ME", "MA", "NH", "RI", "VT", "NJ", "NY", "PA"), 
          Midwest = c("IL", "IN", "MI", "OH", "WI", "IA", "KS", "MN", "MO", "NE", "ND", "SD"), 
          South = c("DE", "DC", "FL", "GA", "MD", "NC", "SC", "VA", "WV", "AL", "KY", "TN", "MS", "AR", "LA", "OK", "TX"), 
          West = c("AZ", "CO", "ID", "MT", "NV", "NM", "UT", "WY", "AK", "CA", "HI", "OR", "WA"))
        

        现在,创建一个样本向量:

        set.seed(1)
        x <- sample(state.abb, 10)
        

        factor 向量,并更改其levels。这可以分两步完成(y &lt;- factor(x); levels(y) &lt;- myLevs) 或在一个看起来很神秘的步骤中完成:

        y <- `levels<-`(factor(x), myLevs)
        

        这是输出:

        x
        #  [1] "IN" "ME" "NV" "TX" "GA" "SD" "TN" "NH" "NE" "AZ"
        y
        #  [1] Midwest   Northeast West      South     South     Midwest   South    
        #  [8] Northeast Midwest   West     
        # Levels: Northeast Midwest South West
        

        【讨论】:

        • 你能详细解释一下一步法吗?我不明白levels&lt;- 部分中的` `
        猜你喜欢
        • 2011-03-17
        • 1970-01-01
        • 2010-12-24
        • 1970-01-01
        • 2016-04-26
        • 2019-07-24
        • 2018-11-28
        • 2013-03-04
        • 2013-09-09
        相关资源
        最近更新 更多