【问题标题】:Mutate returns data in wrong order dplyr is this a bug?变异以错误的顺序返回数据 dplyr 这是一个错误吗?
【发布时间】:2020-01-02 17:23:29
【问题描述】:

我遇到了一个问题,其中 dplyr 中的 mutate() 返回结果顺序错误。我对mutate 的调用使用来自现有列的数据作为输入,但返回的结果就像数据在mutate 之前排序一样排列。

我的具体问题使用dataRetrieval 包从网络获取 USGS/NWIS 数据。在此示例中,我根据站点 ID 检索站点名称。在 `dataRetreival 包中,站点 ID 是存储为字符的数字代码。

library(dataRetrieval)
library(dplyr)

Gauges <- tibble( Name = c("Twisp", "Chewuch", "Andrews" ,"Met@Winthrop", "Met@Twisp", "Met@Pateros", "Met@Goat"),
                  ID = c("12448998" , "12448000","12447390", "12448500" ,"12449500","12449950" , "12447383")
)

## This works correctly with each of the station numbers
readNWISsite(Gauges$ID[1])$station_nm
# [1] "TWISP RIVER NEAR TWISP, WA"

## This does not work correctly
## Order is not right! Station does not correspond with ID  !!
Gauges%>%
      mutate(Station = readNWISsite(ID)$station_nm)

# # A tibble: 7 x 3
# Name         ID       Station                                      
# <chr>        <chr>    <chr>                                        
# 1 Twisp        12448998 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA
# 2 Chewuch      12448000 ANDREWS CREEK NEAR MAZAMA, WA                
# 3 Andrews      12447390 CHEWUCH RIVER AT WINTHROP, WA                
# 4 Met@Winthrop 12448500 METHOW RIVER AT WINTHROP, WA                 
# 5 Met@Twisp    12449500 TWISP RIVER NEAR TWISP, WA                   
# 6 Met@Pateros  12449950 METHOW RIVER AT TWISP, WA                    
# 7 Met@Goat     12447383 METHOW RIVER NEAR PATEROS, WA    

## This works, returning the correct site associated with the gauge number
Gauges%>%
      arrange(ID) %>%
      mutate(Station = readNWISsite(ID)$station_nm)
# # A tibble: 7 x 3
# Name         ID       Station                                      
# <chr>        <chr>    <chr>                                        
# 1 Met@Goat     12447383 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA
# 2 Andrews      12447390 ANDREWS CREEK NEAR MAZAMA, WA                
# 3 Chewuch      12448000 CHEWUCH RIVER AT WINTHROP, WA                
# 4 Met@Winthrop 12448500 METHOW RIVER AT WINTHROP, WA                 
# 5 Twisp        12448998 TWISP RIVER NEAR TWISP, WA                   
# 6 Met@Twisp    12449500 METHOW RIVER AT TWISP, WA                    
# 7 Met@Pateros  12449950 METHOW RIVER NEAR PATEROS, WA  

为什么mutate 在进程中间重新排列数据?或者,这里发生了什么?

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    要了解发生了什么,不要只提取“station_nm”,还要获取“site_no”

    library(dplyr)
    library(dataRetrieval)
    readNWISsite(Gauges$ID)[c('site_no', 'station_nm')]
    #site_no                                    station_nm
    #1 12447383 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA
    #2 12447390                 ANDREWS CREEK NEAR MAZAMA, WA
    #3 12448000                 CHEWUCH RIVER AT WINTHROP, WA
    #4 12448500                  METHOW RIVER AT WINTHROP, WA
    #5 12448998                    TWISP RIVER NEAR TWISP, WA
    #6 12449500                     METHOW RIVER AT TWISP, WA
    #7 12449950                 METHOW RIVER NEAR PATEROS, WA
    

    这里,“site_no”是根据“ID”的整数值排序的。为了纠正这个问题,我们可以使用 rowwise 一次将函数应用于每个“ID”。

    Gauges %>% 
        rowwise() %>% 
        mutate(Station = readNWISsite(ID)$station_nm)
    

    map 来自purrr

    library(purrr)
    Gauges %>%
        mutate(Station = map_chr(ID, ~ readNWISsite(.x)$station_nm))
    # A tibble: 7 x 3
    #  Name         ID       Station                                      
    #  <chr>        <chr>    <chr>                                        
    #1 Twisp        12448998 TWISP RIVER NEAR TWISP, WA                   
    #2 Chewuch      12448000 CHEWUCH RIVER AT WINTHROP, WA                
    #3 Andrews      12447390 ANDREWS CREEK NEAR MAZAMA, WA                
    #4 Met@Winthrop 12448500 METHOW RIVER AT WINTHROP, WA                 
    #5 Met@Twisp    12449500 METHOW RIVER AT TWISP, WA                    
    #6 Met@Pateros  12449950 METHOW RIVER NEAR PATEROS, WA                
    #7 Met@Goat     12447383 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA
    

    或者我们提取两列并使用“ID”和“site_no”执行match

    Gauges %>% 
              mutate(Station = {
               tmp <- readNWISsite(ID)[c('site_no', 'station_nm')]
                  tmp$station_nm[match(ID, tmp$site_no)]})
    # A tibble: 7 x 3
    #  Name         ID       Station                                      
    #  <chr>        <chr>    <chr>                                        
    #1 Twisp        12448998 TWISP RIVER NEAR TWISP, WA                   
    #2 Chewuch      12448000 CHEWUCH RIVER AT WINTHROP, WA                
    #3 Andrews      12447390 ANDREWS CREEK NEAR MAZAMA, WA                
    #4 Met@Winthrop 12448500 METHOW RIVER AT WINTHROP, WA                 
    #5 Met@Twisp    12449500 METHOW RIVER AT TWISP, WA                    
    #6 Met@Pateros  12449950 METHOW RIVER NEAR PATEROS, WA                
    #7 Met@Goat     12447383 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA
    

    【讨论】:

    • 感谢@akrun,mutate() 是否默认对输出进行排序,或者这是输入列是存储为字符或其他内容的数字时的特殊情况?
    • @BrianFisher 不,mutate 会以相同的顺序返回行,但这里readNWISsite 的输出是data.frame 有很多列,并且排序发生在那个级别
    • 谢谢@akrun,我想这是我错过的部分。
    猜你喜欢
    • 2013-03-15
    • 2015-12-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-09-17
    • 2020-10-31
    • 2011-12-01
    • 1970-01-01
    相关资源
    最近更新 更多