【问题标题】:Split a part of a dataframe in R [duplicate]在R中拆分数据框的一部分[重复]
【发布时间】:2016-10-10 20:11:37
【问题描述】:

我正在尝试根据分隔符将数据框列拆分为多个列。我的数据框有一列如下所示-

A0017493 .A 11.86 23:59_10/10/2016 1.00 SURVEYED
A0017493 .A 11.86 23:59_10/11/2016 1.15 DATALOGGER CHANGED
A0017496 .A 11.82 23:59_11/12/2016 2.06 READING IS WRONG

我想要一个包含 6 列的数据框,即站点、文件、变量、时间戳、值和注释,如下所示-

Site File Variable Timestamp Value Comment
A0017493 .A 11.86 23:59_10/10/2016 1.00 SURVEYED
A0017493 .A 11.86 23:59_10/11/2016 1.15 DATALOGGER CHANGED
A0017496 .A 11.82 23:59_11/12/2016 2.06 READING IS WRONG

我尝试通过使用 tidyr 包并使用“单独”语句来做到这一点,因为每个观察都由空格分隔。但是,问题是 cmets 之间有空格,我不想拆分 cmets。有没有办法做到这一点?任何帮助将不胜感激。谢谢!

【问题讨论】:

标签: r


【解决方案1】:

似乎是一个参差不齐的固定宽度格式的文件,所以

library(readr)
pos <- fwf_positions(start = c(1, 9, 13, 19, 36, 42), end = c(9, 13, 19, 36, 42, NA)-2) # if I counted correctly... 
df <- read_fwf(file = "A0017493 .A 11.86 23:59_10/10/2016 1.00 SURVEYED
A0017493 .A 11.86 23:59_10/11/2016 1.15 DATALOGGER CHANGED
A0017496 .A 11.82 23:59_11/12/2016 2.06 READING IS WRONG", col_positions = pos )
glimpse(df)
# Observations: 3
# Variables: 6
# $ X1 <chr> "A001749", "A001749", "A001749"
# $ X2 <chr> ".A", ".A", ".A"
# $ X3 <dbl> 11.86, 11.86, 11.82
# $ X4 <chr> "23:59_10/10/2016", "23:59_10/11/2016", "23:59_11/12/2016"
# $ X5 <chr> "1.00 SU", "1.15 DA", "2.06 RE"
# $ X6 <chr> "VEYED", "ALOGGER CHANGED", "DING IS WRONG"

【讨论】:

    【解决方案2】:

    另一个tidyverse 答案,这次使用tidyr::separate

    我们注意到每一行都是用空格分隔的,除了最后一行(可以包含空格)。在这种情况下,我们可以在空间上拆分最多我们知道我们拥有的列数。

    tidyr::separate 采用可以处理此用例的extra 参数:extra = "merge"

    library(tidyverse)
    
    data.raw = "A0017493 .A 11.86 23:59_10/10/2016 1.00 SURVEYED
    A0017493 .A 11.86 23:59_10/11/2016 1.15 DATALOGGER CHANGED
    A0017496 .A 11.82 23:59_11/12/2016 2.06 READING IS WRONG"
    
    data = read_csv(data.raw, col_names = "Col1")
    
    data %>%
        separate(Col1, into = c("Site", "File", "Variable", "Timestamp", "Value", "Comment"), sep = "\\s", extra = "merge") %>%
        type_convert() %>%
        head()
    
    #> # A tibble: 3 x 6
    #>       Site  File Variable        Timestamp Value            Comment
    #>      <chr> <chr>    <dbl>            <chr> <dbl>              <chr>
    #> 1 A0017493    .A    11.86 23:59_10/10/2016  1.00           SURVEYED
    #> 2 A0017493    .A    11.86 23:59_10/11/2016  1.15 DATALOGGER CHANGED
    #> 3 A0017496    .A    11.82 23:59_11/12/2016  2.06   READING IS WRONG
    

    【讨论】:

    • 嗨 Michael- 我使用了语句 newdata
    【解决方案3】:

    我们可以使用 tidyverse 包库来做你想做的事。关键是根据 ' ' 字符拆分每一行,然后将这些评论列重新组合在一起。这假设您的原始数据包含在名为 df 的数据框中,该数据框中有一个名为 V1 的列。

    library(tidyverse)
    
    df.new <- strsplit(df$V1, split = ' ') %>% # split each row into a character vector contained in a list
        lapply(function(x) data.frame(rbind(x))) %>% # simplify each vector into a character array
        rbind.fill %>% # glue together the ragged rows
        unite('Comment', -X1:-X5, sep = ' ') %>% # recombine every column that is NOT one of the first 5 (i.e., combine comment columns)
        mutate(Comment = gsub(' NA', '', Comment)) %>% # get rid of 'NA' strings
        rename(Site = X1, File = X2, Variable = X3, Timestamp = X4, Value = X5) # relabel columns
        mutate_all(as.character) %>% type_convert # convert columns to appropriate formats
    
          Site File Variable        Timestamp Value            Comment
    1 A0017493   .A    11.86 23:59_10/10/2016  1.00           SURVEYED
    2 A0017493   .A    11.86 23:59_10/11/2016  1.15 DATALOGGER CHANGED
    3 A0017496   .A    11.82 23:59_11/12/2016  2.06   READING IS WRONG
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-05-07
      • 1970-01-01
      相关资源
      最近更新 更多