【发布时间】:2019-08-20 00:17:33
【问题描述】:
美国国家海洋和大气管理局 (NOAA) 拥有大量采用标准水文交换格式 (SHEF) 的数据(例如,如下链接所示)。链接数据有四个主要信息:位置名称、位置 ID、报告值(数字或“NE” - 未估计)和高程区域。我希望将 SHEF 数据转换为四列data.frames。 SHEF 格式虽然名称中有“交换”,但使用起来似乎并不简单,但我可能会遗漏一些东西。
下面的两个链接数据页面都有 1137 行雪数据文本,它们的位置和时间相同,但雪参数不同。
有两个代码块,每个网页一个。除了指向各自参数的 url 之外,它们是相同的。
下面的代码为其中一个参数swe 输出几乎预期的data.frame,但对于另一个参数sub,结果data.frame 显然相对于原始数据部分完整,并且有错误的值(见底部的小标题)。我在想,因为 SHEF 格式至少是一致的,并且因为可能有专门用于这类事情的函数/库,所以转换可能需要一个完全不同的角度/显着减少的步骤?
雪参数1(“swe”)(雪水当量): https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=12(灰色框内的数据)
雪参数2(“sub”)(升华): https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=sb&year=2019&month=3&day=27&hour=12(灰色框内的数据)
我希望有两个data.frames、swe 和sub,每个有 4 列。下面是工作示例。
library(tidyverse)
library(rvest)
library(lubridate)
# webpage to scrape data from, March27's parameter "swe"
march27_param_swe <-
"https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=12"
####### snow water equivalent (swe) [inches] ##########
# scrape
scrapedtext <- read_html(march27_param_swe) %>% html_node(".notes") %>%
html_text()
swe <- tibble(txt = read_lines(scrapedtext)) %>%
mutate(
row = row_number(),
with_code = str_extract(txt, "^[A-z0-9]{5}\\s+\\d+(\\.)?\\d"),
wo_code = str_extract(txt, "^:?\\s+\\d+(\\.)?\\d") %>%
str_extract("[:digit:]+\\.?[:digit:]"),
basin_desc = if_else(!is.na(with_code), lag(txt, 1), NA_character_) %>%
str_sub(start = 2)
)
swe <- swe %>% separate(with_code, c("code", "val"), sep = "\\s+") %>%
mutate(value = case_when(
!is.na(val) ~ val,
!is.na(wo_code) ~ wo_code,
TRUE ~ NA_character_) %>%
as.numeric) %>% filter(!is.na(value))
swe <- swe %>% mutate(code = zoo::na.locf(code), basin_desc = zoo::na.locf(basin_desc) ,
elevz = gsub(".*(inches))","",txt)) %>%
select(code, value, basin_desc, elevz) %>%
mutate(elevz = trimws(elevz))
dim(swe)
#[1] 643 4
head(swe)
# # A tibble: 6 x 4
# code value basin_desc elevz
# <chr> <dbl> <chr> <chr>
# 1 ACSC1 0 San Antonio Ck - Sunol "Entire Basin"
# 2 ADLC1 0 Arroyo De La Laguna "Entire Basin"
# 3 ADOC1 0 Santa Ana R - Prado Dam "Entire Basin"
# 4 AHOC1 0 Arroyo Honda nr San Jose "Entire Basin"
# 5 AKYC1 41.8 SF American nr Kyburz "Entire Basin"
# 6 AKYC1 3.9 SF American nr Kyburz "Base to 5000'"
#which is what I'm hoping for, except that I'd like the `value` to be
#<chr> to be able to accommodate the numbers and "NE" values reported, like this:
# # A tibble: 6 x 4
# code value basin_desc elevz
# <chr> <chr> <chr> <chr>
####### surface sublimation (sub) ##########
# same locations and day, different parameter, "sb", blowing snow
# sublimation [inches]
march27_param_temp <- "https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=sb&year=2019&month=3&day=27&hour=12"
scrapedtext <- read_html(march27_param_temp) %>%
html_node(".notes") %>% html_text()
sub <- tibble(txt = read_lines(scrapedtext)) %>%
mutate(
row = row_number(),
with_code = str_extract(txt, "^[A-z0-9]{5}\\s+\\d+(\\.)?\\d"),
wo_code = str_extract(txt, "^:?\\s+\\d+(\\.)?\\d") %>%
str_extract("[:digit:]+\\.?[:digit:]"),
basin_desc = if_else(!is.na(with_code), lag(txt, 1), NA_character_) %>%
str_sub(start = 2)
)
sub <- sub %>% separate(with_code, c("code", "val"), sep = "\\s+") %>%
mutate(value = case_when(
!is.na(val) ~ val,
!is.na(wo_code) ~ wo_code,
TRUE ~ NA_character_) %>%
as.numeric) %>% filter(!is.na(value))
sub <- sub %>% mutate(code = zoo::na.locf(code), basin_desc = zoo::na.locf(basin_desc) ,
elevz = gsub(".*(inches))","",txt)) %>%
select(code, value, basin_desc, elevz) %>%
mutate(elevz = trimws(elevz))
dim(sub)
#[1] 263 4 #dim[swe] was 643x4
head(sub)
# A tibble: 6 x 4
#code value basin_desc elevz
#<chr> <dbl> <chr> <chr>
#1 ADOC1 0 Santa Ana R - Prado Dam "Entire Basin"
#2 ADOC1 0 Santa Ana R - Prado Dam "Base to 5000'"
#3 ARCC1 0 Mad River - Arcata "Entire Basin"
#4 ARCC1 0 Mad River - Arcata "Base to 5000'"
#5 BCAC1 0 Little Truckee - Boca Dam "Entire Basin"
#So `sub` should be the same size `data.frame` as swe, and
#sub$value's are supposed to be (as per the source page above:
# https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=sb&year=2019&month=3&day=27&hour=12 ):
#head(desired_sub)
# A tibble: 6 x 4
#code value basin_desc elevz
#<chr> <chr> <chr> <chr>
#1 ADOC1 NE Santa Ana R - Prado Dam "Entire Basin"
#2 ADOC1 NE Santa Ana R - Prado Dam "Base to 5000'"
#3 ARCC1 0.000 Mad River - Arcata "Entire Basin"
#4 ARCC1 NE Mad River - Arcata "Base to 5000'"
#5 BCAC1 -0.016 Little Truckee - Boca Dam "Entire Basin"
【问题讨论】:
-
在我看来这是一个固定格式的文本文件。你可以尝试根据固定宽度解析它们吗?
-
好点@Tung。是的,我会尝试。
-
请发布解决方案作为答案,以帮助未来的读者,如果你得到它的工作
-
我还没有机会学习/尝试固定宽度函数,但如果我得到一个解决方案,我绝对会发布一个解决方案。
标签: r regex tidyverse tidyr opendata