【问题标题】:Reshaping a dataframe with NA values in R在 R 中使用 NA 值重塑数据框
【发布时间】:2020-06-13 08:51:16
【问题描述】:

我有一个带有 NA 值的数据框

 df <- data.frame("About" = c("Ram","Std 8",NA,NA,NA,"John", "Std 9", NA, NA,NA,NA),
                 "Questions" = c(NA,NA,"Q1","Q2","Q3",NA,NA,"Q1","Q2","Q3","Q4"),
                 "Ratings" = c(NA,NA,7,7,7,NA,NA,7,7,7,7), stringsAsFactors = FALSE)

预期的输出如下:

 expectedOutput <- data.frame("About" = c("Ram","John"),
                             "Standard" = c("Std 8", "Std 9"),
                             "Q1" = c(7,7),
                             "Q2" = c(7,7),
                             "Q3" = c(7,7),
                             "Q4" = c(0,7))

我尝试使用reshape 函数来实现这一点

DataTransform <- reshape(df, idvar = "About", v.names = "Ratings", timevar = "Questions", direction = "wide")

谁能帮助我通过重塑给定的数据框来实现预期的输出?

提前致谢!!

【问题讨论】:

  • 您的数据格式似乎很复杂。你有例如名称“Ram”,有点标题“Std 8”,然后是每个问题的值。您是如何生成数据的,您有不同的导出选项吗?
  • 这不是重塑。这是一个数据操作问题,必须通过关注每个所需的列来完成。
  • 这是存储在后端的原始数据。我正在尝试带来所需的格式。但无法达到预期效果

标签: r dataframe reshape


【解决方案1】:

base R 方法,

df2 <- df  # Assigning the df into a new one

通过创建一个新列标准,用最后出现的non NA 值填充NA 值,

df2$Standard <- na.omit(df[,1])[cumsum(!is.na(df[,1]))] 

同样,在取消包含 Std 的名称后,通过将 About 列中的所有值替换为非 NA 值,finaldf 出现。

df2[grepl("Std",df2[,1]),1] <- NA
df2[,1] <- na.omit(df2[,1])[cumsum(!is.na(df2[,1]))] 
finaldf <- df2[!is.na(df2[,"Ratings"]),]

   About Questions Ratings Standard
3    Ram        Q1       7    Std 8
4    Ram        Q2       7    Std 8
5    Ram        Q3       7    Std 8
8   John        Q1       7    Std 9
9   John        Q2       7    Std 9
10  John        Q3       7    Std 9
11  John        Q4       7    Std 9

这与您使用 reshape() 函数所做的部分相同。

out <- reshape(finaldf, idvar = "About", v.names = "Ratings", timevar = "Questions", direction = "wide")
out[is.na(out)] <- 0
colnames(out) <- c("About","Standard","Q1","Q2","Q3","Q4")

给予,

  About Standard Q1 Q2 Q3 Q4
3   Ram    Std 8  7  7  7  0
8  John    Std 9  7  7  7  7

【讨论】:

    【解决方案2】:

    这是一个简洁明了的tidyverse 方法。有两个假设这会起作用:

    1. 在学生姓名之后,下一行总是会跟随一个包含"Std" 的字符串。 (如果还有其他模式,您可以通过将它们添加到 str_detect 调用来扩展此方法。

    2. About 的所有其他行都是 NA。

    此外,从您的预期输出来看,您似乎希望将Questions 中的缺失值视为0。如果您更喜欢NA,可以将values_fill 参数放在pivot_wider 中。

    library(tidyverse)
    
    df <- data.frame("About" = c("Ram","Std 8",NA,NA,NA,"John", "Std 9", NA, NA,NA,NA),
                    "Questions" = c(NA,NA,"Q1","Q2","Q3",NA,NA,"Q1","Q2","Q3","Q4"),
                    "Ratings" = c(NA,NA,7,7,7,NA,NA,7,7,7,7), stringsAsFactors = FALSE)
    
    df %>%
      mutate(About = ifelse(str_detect(lead(About), "Std") & !is.na(About),
                           paste(About, lead(About)),
                           NA)) %>%
      fill(About) %>% 
      drop_na(Questions) %>% 
      pivot_wider(names_from = Questions,
                  values_from = Ratings,
                  values_fill = 0
      )
    
    #> # A tibble: 2 x 5
    #>   About         Q1    Q2    Q3    Q4
    #>   <chr>      <dbl> <dbl> <dbl> <dbl>
    #> 1 Ram Std 8      7     7     7     0
    #> 2 John Std 9     7     7     7     7
    

    reprex package (v0.3.0) 于 2020 年 6 月 13 日创建

    【讨论】:

      【解决方案3】:

      在使用 reshape 或 pivot_wider 之前,我们需要转换适合这种转换的数据。

      library(tidyverse) #for all the awesome packages
      library(janitor) #to clean names
      
      
      df <- data.frame("About" = c("Ram","Std 8",NA,NA,NA,"John", "Std 9", NA, NA,NA,NA),
                       "Questions" = c(NA,NA,"Q1","Q2","Q3",NA,NA,"Q1","Q2","Q3","Q4"),
                       "Ratings" = c(NA,NA,7,7,7,NA,NA,7,7,7,7), stringsAsFactors = FALSE)
      
      df %>%
        as_tibble() -> df # I like to work with tibble
      
      df
      #> # A tibble: 11 x 3
      #>    About Questions Ratings
      #>    <chr> <chr>       <dbl>
      #>  1 Ram   <NA>           NA
      #>  2 Std 8 <NA>           NA
      #>  3 <NA>  Q1              7
      #>  4 <NA>  Q2              7
      #>  5 <NA>  Q3              7
      #>  6 John  <NA>           NA
      #>  7 Std 9 <NA>           NA
      #>  8 <NA>  Q1              7
      #>  9 <NA>  Q2              7
      #> 10 <NA>  Q3              7
      #> 11 <NA>  Q4              7
      
      
      #I found I can remove a column out from the above tibble, the below function moves the values to the left if there is a NA
      
      t(apply(df, 1, function(x) c(x[!is.na(x)], x[is.na(x)]))) -> df[] 
      
      df
      #> # A tibble: 11 x 3
      #>    About Questions Ratings
      #>    <chr> <chr>     <chr>  
      #>  1 Ram    <NA>     <NA>   
      #>  2 Std 8  <NA>     <NA>   
      #>  3 Q1    " 7"      <NA>   
      #>  4 Q2    " 7"      <NA>   
      #>  5 Q3    " 7"      <NA>   
      #>  6 John   <NA>     <NA>   
      #>  7 Std 9  <NA>     <NA>   
      #>  8 Q1    " 7"      <NA>   
      #>  9 Q2    " 7"      <NA>   
      #> 10 Q3    " 7"      <NA>   
      #> 11 Q4    " 7"      <NA>
      
      
      df %>% 
        clean_names() %>%  # no capitals
        dplyr::select(-ratings) %>% # removing the extra columns
        mutate(questions = questions %>% parse_number()) -> df1 # make the second column numeric
      
      
      df1
      #> # A tibble: 11 x 2
      #>    about questions
      #>    <chr>     <dbl>
      #>  1 Ram          NA
      #>  2 Std 8        NA
      #>  3 Q1            7
      #>  4 Q2            7
      #>  5 Q3            7
      #>  6 John         NA
      #>  7 Std 9        NA
      #>  8 Q1            7
      #>  9 Q2            7
      #> 10 Q3            7
      #> 11 Q4            7
      
      # this for loop will get me a vector for the name column which I can use to append it to the df
      
      name <- as.character()
      for(i in 1:nrow(df1)){
      
        if(is.na(df1[i,2])){
          if(is.na(df1[i+1,2])){
            name <- c(name , as.character(df1[i,1]))
          } else {
            name <- c(name, NA)
          }
        } else {
          name <- c(name, NA)
        }
      
      }
      
      name 
      #>  [1] "Ram"  NA     NA     NA     NA     "John" NA     NA     NA     NA    
      #> [11] NA
      
      name %>% 
        enframe(name = NULL, value = "name") -> name_df #converting vector to tibble
      
      name_df 
      #> # A tibble: 11 x 1
      #>    name 
      #>    <chr>
      #>  1 Ram  
      #>  2 <NA> 
      #>  3 <NA> 
      #>  4 <NA> 
      #>  5 <NA> 
      #>  6 John 
      #>  7 <NA> 
      #>  8 <NA> 
      #>  9 <NA> 
      #> 10 <NA> 
      #> 11 <NA>
      
      df1 %>% 
        bind_cols(name_df)%>% #binding the new column to the original df
        mutate(std = ifelse(is.na(questions) & is.na(name), about, NA)) %>% # mutating a new column for standard
        fill(name) %>% # this will fill the NA with non NA previous value
        fill(std) %>% 
        drop_na(questions) %>% # dropping unnecessary rows
        pivot_wider(names_from = "about", values_from = "questions") -> final_df # now I can use pivot_wider to get the expected result
      
      final_df
      #> # A tibble: 2 x 6
      #>   name  std      Q1    Q2    Q3    Q4
      #>   <chr> <chr> <dbl> <dbl> <dbl> <dbl>
      #> 1 Ram   Std 8     7     7     7    NA
      #> 2 John  Std 9     7     7     7     7
      

      reprex package (v0.3.0) 于 2020-06-13 创建

      【讨论】:

        猜你喜欢
        • 2017-02-11
        • 1970-01-01
        • 1970-01-01
        • 2014-03-10
        • 1970-01-01
        相关资源
        最近更新 更多