【问题标题】:Issue while splitting a column of a data frame into different columns将数据框的列拆分为不同列时出现问题
【发布时间】:2022-01-09 23:49:58
【问题描述】:

这是我正在使用的数据帧的示例。

structure(list(Company.Name = c("Ample Softech System", "Ziff Davis LLC", 
"IIM Kozhikkode", "Perennial", "Irupar Sociedad Cooperativa", 
"md", ""), Job.Title = c("Data Analyst", "Data Analyst", "Data Analyst", 
"Data Analyst", "Data Analyst", "Data Analyst", "Data Analyst"
), Salaries.Reported = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), Location = c("Pune", 
"Pune", "Pune", "Pune", "Pune", "Pune", "Pune"), Salary = c("₹35,563/mo", 
"₹5,21,474/yr", "₹7,64,702/yr", "₹16,123/mo", "₹6,04,401/yr", 
"AFN 1,56,179/yr", "₹23,500/mo")), row.names = 2274:2280, class = "data.frame")

Salary 列包含 (Currency_symbol+Figure+periodicity) 模式的数字 例如:₹35,563/月

我一直在尝试将模式分成不同的列。我使用了以下代码。

smpl = separate(sample, col = Salary, into = c( "Currency_symbol", "Salary_copy"), sep = 1, remove = TRUE, convert =  TRUE) #separates currency_symbol into separate column
smpl
smpl2 = separate(smpl, col = Salary_copy, into = c('Salary_copy', 'Periodicity'), sep = -3, remove = TRUE, convert = TRUE) # separates periodicity to separate column
smpl2

我面临的问题是一行包含 3 个字符作为货币符号 (AFN),而其他的是单个字符。

因此,上面提到的这些特定代码行无法将模式分成该特定行的相应列。

如果我更改代码的 sep 参数的索引,所有其他行都会受到影响。我该如何解决这个特定问题?

【问题讨论】:

    标签: r split data-cleaning


    【解决方案1】:

    一个可能的解决方案:

    library(tidyverse)
    
    df %>% 
      separate(Salary, sep="((?<=^\\D)(?=\\d))|((?<=\\D)\\s)", into=str_c("col", 1:2)) %>% 
      separate(col2, sep = "/", into = str_c("col",2:3))
    
    #>                     Company.Name    Job.Title Salaries.Reported Location col1
    #> 2274        Ample Softech System Data Analyst                 1     Pune    ₹
    #> 2275              Ziff Davis LLC Data Analyst                 1     Pune    ₹
    #> 2276              IIM Kozhikkode Data Analyst                 1     Pune    ₹
    #> 2277                   Perennial Data Analyst                 1     Pune    ₹
    #> 2278 Irupar Sociedad Cooperativa Data Analyst                 1     Pune    ₹
    #> 2279                          md Data Analyst                 1     Pune  AFN
    #> 2280                             Data Analyst                 1     Pune    ₹
    #>          col2 col3
    #> 2274   35,563   mo
    #> 2275 5,21,474   yr
    #> 2276 7,64,702   yr
    #> 2277   16,123   mo
    #> 2278 6,04,401   yr
    #> 2279 1,56,179   yr
    #> 2280   23,500   mo
    

    【讨论】:

      【解决方案2】:

      使用extract 和更简单的正则表达式的另一个解决方案。额外的步骤会修剪空格并从工资金额中删除逗号。

      df2 <- df %>% 
        extract(Salary, c('currency', 'amount', 'period'), '^(\\D+)([0-9,]+)/(.*)') %>% 
        mutate(
          currency = gsub(' ', '', currency),
          amount = as.numeric(gsub(',', '', amount))
        )
      
                          Company.Name    Job.Title Salaries.Reported Location currency amount period
      2274        Ample Softech System Data Analyst                 1     Pune        ₹  35563     mo
      2275              Ziff Davis LLC Data Analyst                 1     Pune        ₹ 521474     yr
      2276              IIM Kozhikkode Data Analyst                 1     Pune        ₹ 764702     yr
      2277                   Perennial Data Analyst                 1     Pune        ₹  16123     mo
      2278 Irupar Sociedad Cooperativa Data Analyst                 1     Pune        ₹ 604401     yr
      2279                          md Data Analyst                 1     Pune      AFN 156179     yr
      2280                             Data Analyst                 1     Pune        ₹  23500     mo
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2017-11-15
        • 1970-01-01
        • 1970-01-01
        • 2020-06-02
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多