【问题标题】:R: Replacing Strings with their Most Common VariantR:用最常见的变体替换字符串
【发布时间】:2019-11-01 05:33:00
【问题描述】:

我希望标准化一组手动输入的字符串,以便:

index   fruit
1   Apple Pie
2   Apple Pie.
3   Apple. Pie
4   Apple Pie
5   Pear

应该看起来像:

index   fruit
1   Apple Pie
2   Apple Pie
3   Apple Pie
4   Apple Pie
5   Pear

对于我的用例,按phonetic 声音对它们进行分组很好,但我错过了如何用最常见的字符串替换最不常见的字符串。

library(tidyverse)  
library(stringdist)

index <- seq(1,5,1)
fruit <- c("Apple Pie", "Apple Pie.", "Apple. Pie", "Apple Pie", "Pear")

df <- data.frame(index, fruit) %>%
  mutate(grouping = phonetic(fruit)) %>%
  add_count(fruit) %>%
  # Missing Code
  select(index, fruit)

【问题讨论】:

    标签: r tidyverse recode


    【解决方案1】:

    另一种方式可能是:

    fruit %>%
     enframe() %>%
     mutate(grouping = phonetic(fruit)) %>%
     add_count(value, grouping) %>%
     group_by(grouping) %>%
     mutate(value = value[match(max(n), n)]) %>%
     select(-n) %>%
     ungroup()
    
       name value     grouping
      <int> <chr>     <chr>   
    1     1 Apple Pie A141    
    2     2 Apple Pie A141    
    3     3 Apple Pie A141    
    4     4 Apple Pie A141    
    5     5 Pear      P600 
    

    【讨论】:

      【解决方案2】:

      听起来你需要group_by分组,然后选择最频繁的(Mode)项

      df%>%mutate(grouping = phonetic(fruit))%>%
           group_by(grouping)%>%
           mutate(fruit = names(which.max(table(fruit))))
      
      # A tibble: 5 x 3
      # Groups:   grouping [2]
        index     fruit grouping
        <dbl>    <fctr>    <chr>
      1     1 Apple Pie     A141
      2     2 Apple Pie     A141
      3     3 Apple Pie     A141
      4     4 Apple Pie     A141
      5     5      Pear     P600
      

      【讨论】:

        【解决方案3】:

        我们可以使用str_remove 删除.

        library(dplyr)
        library(stringr)
        data.frame(index, fruit) %>% 
            mutate(fruit = str_remove(fruit, "\\."))
        # index     fruit
        #1     1 Apple Pie
        #2     2 Apple Pie
        #3     3 Apple Pie
        #4     4 Apple Pie
        #5     5      Pear
        

        如果我们需要使用phonetic并找到最频繁的值

        Mode <- function(x) {
          ux <- unique(x)
          ux[which.max(tabulate(match(x, ux)))]
        }
        
        
        data.frame(index, fruit) %>%
           mutate(grouping = phonetic(fruit)) %>%
           group_by(grouping) %>% 
           mutate(fruit = Mode(fruit))
        # A tibble: 5 x 3
        # Groups:   grouping [2]
        #  index fruit     grouping
        #  <dbl> <fct>     <chr>   
        #1     1 Apple Pie A141    
        #2     2 Apple Pie A141    
        #3     3 Apple Pie A141    
        #4     4 Apple Pie A141    
        #5     5 Pear      P600    
        

        【讨论】:

        • 创建Mode 函数是否有任何灵活性优势?
        • @rsylatian 可以在其他功能中作为块重用。
        猜你喜欢
        • 2023-04-04
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2014-12-10
        • 1970-01-01
        • 1970-01-01
        • 2020-07-09
        • 1970-01-01
        相关资源
        最近更新 更多