【问题标题】:how to tidy Standard Hydrologic Exchange Format (SHEF) data如何整理标准水文交换格式 (SHEF) 数据
【发布时间】:2019-08-20 00:17:33
【问题描述】:

美国国家海洋和大气管理局 (NOAA) 拥有大量采用标准水文交换格式 (SHEF) 的数据(例如,如下链接所示)。链接数据有四个主要信息:位置名称、位置 ID、报告值(数字或“NE” - 未估计)和高程区域。我希望将 SHEF 数据转换为四列data.frames。 SHEF 格式虽然名称中有“交换”,但使用起来似乎并不简单,但我可能会遗漏一些东西。

下面的两个链接数据页面都有 1137 行雪数据文本,它们的位置和时间相同,但雪参数不同。

有两个代码块,每个网页一个。除了指向各自参数的 url 之外,它们是相同的。

下面的代码为其中一个参数swe 输出几乎预期的data.frame,但对于另一个参数sub,结果data.frame 显然相对于原始数据部分完整,并且有错误的值(见底部的小标题)。我在想,因为 SHEF 格式至少是一致的,并且因为可能有专门用于这类事情的函数/库,所以转换可能需要一个完全不同的角度/显着减少的步骤?

雪参数1(“swe”)(雪水当量): https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=12(灰色框内的数据)

雪参数2(“sub”)(升华): https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=sb&year=2019&month=3&day=27&hour=12(灰色框内的数据)

我希望有两个data.framesswesub,每个有 4 列。下面是工作示例。

library(tidyverse)
library(rvest)
library(lubridate)


# webpage to scrape data from, March27's parameter "swe"      
march27_param_swe <- 
 "https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=12"

####### snow water equivalent (swe) [inches] ##########

# scrape
scrapedtext <- read_html(march27_param_swe) %>% html_node(".notes") %>%
               html_text() 



swe <- tibble(txt = read_lines(scrapedtext)) %>%

 mutate(
         row = row_number(),
         with_code = str_extract(txt, "^[A-z0-9]{5}\\s+\\d+(\\.)?\\d"),
         wo_code = str_extract(txt, "^:?\\s+\\d+(\\.)?\\d") %>% 
         str_extract("[:digit:]+\\.?[:digit:]"),
         basin_desc = if_else(!is.na(with_code), lag(txt, 1), NA_character_) %>% 
         str_sub(start = 2)
       ) 

swe <- swe %>% separate(with_code, c("code", "val"), sep = "\\s+") %>%  
       mutate(value = case_when(
                                !is.na(val) ~ val,
                                !is.na(wo_code) ~ wo_code,
                                TRUE ~ NA_character_) %>%
                                as.numeric) %>% filter(!is.na(value)) 

swe <- swe %>% mutate(code = zoo::na.locf(code), basin_desc = zoo::na.locf(basin_desc) ,
                      elevz = gsub(".*(inches))","",txt))  %>%
                      select(code, value, basin_desc, elevz) %>%
                      mutate(elevz = trimws(elevz))

 dim(swe) 
 #[1] 643   4

 head(swe)  
 # # A tibble: 6 x 4
 # code    value basin_desc               elevz             
 # <chr>   <dbl> <chr>                    <chr>             
 # 1 ACSC1   0   San Antonio Ck - Sunol   "Entire Basin"   
 # 2 ADLC1   0   Arroyo De La Laguna      "Entire Basin"   
 # 3 ADOC1   0   Santa Ana R - Prado Dam  "Entire Basin"   
 # 4 AHOC1   0   Arroyo Honda nr San Jose "Entire Basin"   
 # 5 AKYC1  41.8 SF American nr Kyburz    "Entire Basin"   
 # 6 AKYC1   3.9 SF American nr Kyburz    "Base  to 5000'"

 #which is what I'm hoping for, except that I'd like the `value` to be 
 #<chr> to be able to accommodate the numbers and "NE" values reported, like this:

 # # A tibble: 6 x 4
 # code  value basin_desc               elevz             
 # <chr> <chr> <chr>                    <chr>          

 #######  surface sublimation (sub) ##########

# same locations and day, different parameter, "sb", blowing snow 
# sublimation [inches]

march27_param_temp <- "https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=sb&year=2019&month=3&day=27&hour=12"

scrapedtext <- read_html(march27_param_temp) %>%
               html_node(".notes") %>% html_text() 

sub <- tibble(txt = read_lines(scrapedtext)) %>%
  mutate(
         row = row_number(),
         with_code = str_extract(txt, "^[A-z0-9]{5}\\s+\\d+(\\.)?\\d"),
         wo_code = str_extract(txt, "^:?\\s+\\d+(\\.)?\\d") %>% 
         str_extract("[:digit:]+\\.?[:digit:]"),
         basin_desc = if_else(!is.na(with_code), lag(txt, 1), NA_character_) %>%
         str_sub(start = 2)
) 

sub <- sub %>% separate(with_code, c("code", "val"), sep = "\\s+") %>%  
        mutate(value = case_when(
                                 !is.na(val) ~ val,
                                 !is.na(wo_code) ~ wo_code,
                                 TRUE ~ NA_character_) %>%
                                 as.numeric) %>% filter(!is.na(value)) 

sub <- sub %>% mutate(code = zoo::na.locf(code), basin_desc = zoo::na.locf(basin_desc) ,
                      elevz = gsub(".*(inches))","",txt))  %>%
                      select(code, value, basin_desc, elevz)  %>%
                      mutate(elevz = trimws(elevz))

dim(sub)
#[1] 263   4    #dim[swe] was 643x4

head(sub)

 # A tibble: 6 x 4
 #code     value   basin_desc                elevz             
 #<chr>    <dbl>   <chr>                     <chr>             
 #1 ADOC1     0    Santa Ana R - Prado Dam   "Entire Basin"   
 #2 ADOC1     0    Santa Ana R - Prado Dam   "Base  to 5000'"
 #3 ARCC1     0    Mad River - Arcata        "Entire Basin"   
 #4 ARCC1     0    Mad River - Arcata        "Base  to 5000'"
 #5 BCAC1     0    Little Truckee - Boca Dam "Entire Basin"   

#So `sub` should be the same size `data.frame` as swe, and 
#sub$value's are supposed to be (as per the source page above: 
# https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=sb&year=2019&month=3&day=27&hour=12 ):

 #head(desired_sub)
 # A tibble: 6 x 4
 #code        value  basin_desc                elevz             
 #<chr>       <chr>  <chr>                     <chr>             
 #1 ADOC1     NE     Santa Ana R - Prado Dam   "Entire Basin"   
 #2 ADOC1     NE     Santa Ana R - Prado Dam   "Base  to 5000'"
 #3 ARCC1     0.000  Mad River - Arcata        "Entire Basin"   
 #4 ARCC1     NE     Mad River - Arcata        "Base  to 5000'"
 #5 BCAC1    -0.016  Little Truckee - Boca Dam "Entire Basin"   

【问题讨论】:

  • 在我看来这是一个固定格式的文本文件。你可以尝试根据固定宽度解析它们吗?
  • 好点@Tung。是的,我会尝试。
  • 请发布解决方案作为答案,以帮助未来的读者,如果你得到它的工作
  • 我还没有机会学习/尝试固定宽度函数,但如果我得到一个解决方案,我绝对会发布一个解决方案。

标签: r regex tidyverse tidyr opendata


【解决方案1】:

我认为您的问题可能是由于数据输出不一致:带有代码的行可以以冒号开头,也可以不带冒号。

我制作了一个新代码,通过搜索以代码(或:+代码)开头的行来识别数据块,然后将每个块读入数据帧。

试试这个:

library(rvest)
library(stringr)

# Read an individual block
readBlock = function(text){
  basin = str_replace(string = text[1], pattern = "^:", replacement = "")
  block = text[-1]
  code = str_match(block[1], "[A-Z0-9]{5}")[1]
  block = str_replace(block, "^(:?[^ ]+|:)", "")
  block = str_replace(block, "%", "(%)")
  block = str_replace_all(block, "[;():]", "|")
  block = trimws(block)
  block = str_split(block,"\\|")
  block = as.data.frame(do.call(rbind, block))
  colnames(block) = c("Value","Calc", "Units", "Location")
  block$Code = code
  block$Basin = basin
  return(block)
}

# Find blocks starting index
findBlocks = function(text){
  index = which(str_detect(text,"^:?[A-Z0-9]{5}"))
  index = index[index > 10]
  index = index - 1
  index = c(index, 1 + which(str_detect(text,"\\.END")))
  return(index)
}

# return a data frame with all blocks
readAllBlocks = function(index, text){
  blocks = lapply(1:(length(index)-1), function(x){
    blockText = text[index[x]:(index[x+1]-2)]
    readBlock(blockText)
  })
  blocks = do.call(rbind, blocks)
  return(blocks)
}


####### snow water equivalent (swe) [inches] ##########
march27_param_swe = "https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=12"
# scrape
scrapedtext = html_text(html_node(read_html(march27_param_swe),".notes"))
scrapedtext = unlist(str_split(scrapedtext,"\n"))
block_index = findBlocks(scrapedtext)
swe = readAllBlocks(block_index, scrapedtext)



#######  surface sublimation (sub) ##########
march27_param_temp = "https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=sb&year=2019&month=3&day=27&hour=12"
scrapedtext = html_text(html_node(read_html(march27_param_temp),".notes"))
scrapedtext = unlist(str_split(scrapedtext,"\n"))
block_index = findBlocks(scrapedtext)
sub = readAllBlocks(block_index, scrapedtext)

编辑: 如果单元% 没有括号,则在替换之前将其括起来。这条线应该可以解决问题:

block = str_replace(block, "%", "(%)")

我编辑了上面的代码以在需要的地方包含它。

【讨论】:

  • 我会从中学到很多东西。出色的工作!谢谢你。它适用于 6 个参数中的 5 个参数 - 参数 SCA ( url nohrsc.noaa.gov/shef_archive/… ) 引发错误,我认为这是因为单元 % 没有像其他参数那样的括号,例如 (inches)。更新函数以适应% 会很困难吗?
猜你喜欢
  • 2012-05-28
  • 2016-12-12
  • 2020-05-28
  • 2015-03-18
  • 2013-05-05
  • 2021-12-18
  • 1970-01-01
  • 1970-01-01
  • 2012-04-26
相关资源
最近更新 更多