在当前版本的 dplyr (1.0.5) 中,使用dplyr::select() 时属性“variable.labels”完全丢失。似乎存储列标签的更好方法是将属性与向量本身相关联,而不是与整个数据帧相关联,这就是 Haven 包中的函数所做的。使用haven::read_sav() 作为替代方案将在转换数据时保留标签。
这是一个可重复的比较:
library(foreign)
# import example SAV from foreign
sav <- system.file("files", "electric.sav", package = "foreign")
dat_for <- read.spss(file = sav, to.data.frame = TRUE)
# labels are saved as a variable.labels attribute for the whole dataframe
str(dat_for)
#> 'data.frame': 240 obs. of 13 variables:
#> $ CASEID : num 13 30 53 84 89 102 117 132 151 153 ...
#> $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#> $ AGE : num 40 49 43 50 43 50 45 47 53 49 ...
#> $ DBP58 : num 70 87 89 105 110 88 70 79 102 99 ...
#> $ EDUYR : num 16 11 12 8 NA 8 NA 9 12 14 ...
#> $ CHOL58 : num 321 246 262 275 301 261 212 372 216 251 ...
#> $ CGT58 : num 0 60 0 15 25 30 0 30 0 10 ...
#> $ HT58 : num 68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#> $ WT58 : num 190 204 162 152 148 142 196 193 172 162 ...
#> $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#> $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#> $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#> $ CHD : num 1 1 1 1 1 1 1 1 1 1 ...
#> - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#> ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...
# dplyr::select() loses the variable.labels attribute:
library(dplyr, warn.conflicts = FALSE)
dat_for %>% select(1:2) %>% str()
#> 'data.frame': 240 obs. of 2 variables:
#> $ CASEID : num 13 30 53 84 89 102 117 132 151 153 ...
#> $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
# but to be fair, base slicing also loses the attribute:
dat_for[,1:2] %>% str()
#> 'data.frame': 240 obs. of 2 variables:
#> $ CASEID : num 13 30 53 84 89 102 117 132 151 153 ...
#> $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
# import same file with haven
library(haven)
dat_hav <- read_sav(file = sav)
# labels are attributes of columns
str(dat_hav$CASEID)
#> num [1:240] 13 30 53 84 89 102 117 132 151 153 ...
#> - attr(*, "label")= chr "CASE IDENTIFICATION NUMBER"
#> - attr(*, "format.spss")= chr "F4.0"
#> - attr(*, "display_width")= int 0
# dplyr::select() keeps them
dat_hav %>% select(1:2) %>% str()
#> tibble[,2] [240 × 2] (S3: tbl_df/tbl/data.frame)
#> $ CASEID : num [1:240] 13 30 53 84 89 102 117 132 151 153 ...
#> ..- attr(*, "label")= chr "CASE IDENTIFICATION NUMBER"
#> ..- attr(*, "format.spss")= chr "F4.0"
#> ..- attr(*, "display_width")= int 0
#> $ FIRSTCHD: dbl+lbl [1:240] 3, 3, 2, 3, 2, 3, 3, 3, 2, 2, 6, 2, 3, 5, 3, 3, 3, 3, ...
#> ..@ label : chr "FIRST CHD EVENT"
#> ..@ format.spss : chr "F1.0"
#> ..@ display_width: int 0
#> ..@ labels : Named num [1:5] 1 2 3 5 6
#> .. ..- attr(*, "names")= chr [1:5] "NO CHD" "SUDDEN DEATH" "NONFATALMI" "FATAL MI" ...
#> - attr(*, "label")= chr " SPSS/PC+"
由reprex package (v1.0.0) 于 2021-03-30 创建