R for Data Science总结之——modelr(3)
这一章中我们将对大型数据集进行分组建模,深入挖掘数据集特征:
library(modelr)
library(tidyverse)
library(gapminder)
gapminder
#> # A tibble: 1,704 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> 5 Afghanistan Asia 1972 36.1 13079460 740.
#> 6 Afghanistan Asia 1977 38.4 14880372 786.
#> # ... with 1,698 more rows
这里主要挖掘人均寿命lifeExp与年份year以及国家country之间的关系,首先作图:
gapminder %>%
ggplot(aes(year, lifeExp, group = country)) +
geom_line(alpha = 1/3)
如果我们挑出其中某一个国家,其结果与前几章相似:
nz <- filter(gapminder, country == "New Zealand")
nz %>%
ggplot(aes(year, lifeExp)) +
geom_line() +
ggtitle("Full data = ")
nz_mod <- lm(lifeExp ~ year, data = nz)
nz %>%
add_predictions(nz_mod) %>%
ggplot(aes(year, pred)) +
geom_line() +
ggtitle("Linear trend + ")
nz %>%
add_residuals(nz_mod) %>%
ggplot(aes(year, resid)) +
geom_hline(yintercept = 0, colour = "white", size = 3) +
geom_line() +
ggtitle("Remaining pattern")
而对一个数据集的深入挖掘需要对每个国家分别建模,这里用到purrr包的一个map函数,而对整个数据集进行操作,我们需要用到一个嵌套的数据集,这里可以先用group_by()分组,再nest()嵌套:
by_country <- gapminder %>%
group_by(country, continent) %>%
nest()
by_country
#> # A tibble: 142 x 3
#> country continent data
#> <fct> <fct> <list>
#> 1 Afghanistan Asia <tibble [12 × 4]>
#> 2 Albania Europe <tibble [12 × 4]>
#> 3 Algeria Africa <tibble [12 × 4]>
#> 4 Angola Africa <tibble [12 × 4]>
#> 5 Argentina Americas <tibble [12 × 4]>
#> 6 Australia Oceania <tibble [12 × 4]>
#> # ... with 136 more rows
我们从中抽出一个数据集:
by_country$data[[1]]
#> # A tibble: 12 x 4
#> year lifeExp pop gdpPercap
#> <int> <dbl> <int> <dbl>
#> 1 1952 28.8 8425333 779.
#> 2 1957 30.3 9240934 821.
#> 3 1962 32.0 10267083 853.
#> 4 1967 34.0 11537966 836.
#> 5 1972 36.1 13079460 740.
#> 6 1977 38.4 14880372 786.
#> # ... with 6 more rows
这里可以看出,一个分组的数据集,每一行是一个观测值,而一个嵌套数据集,每一行是一个组。
这里再创建一个函数用于对每一个组进行建模:
country_model <- function(df) {
lm(lifeExp ~ year, data = df)
}
然后可以将其加到每一个组:
models <- map(by_country$data, country_model)
我们再在嵌套数据集中加上模型这个变量:
by_country <- by_country %>%
mutate(model = map(data, country_model))
by_country
#> # A tibble: 142 x 4
#> country continent data model
#> <fct> <fct> <list> <list>
#> 1 Afghanistan Asia <tibble [12 × 4]> <S3: lm>
#> 2 Albania Europe <tibble [12 × 4]> <S3: lm>
#> 3 Algeria Africa <tibble [12 × 4]> <S3: lm>
#> 4 Angola Africa <tibble [12 × 4]> <S3: lm>
#> 5 Argentina Americas <tibble [12 × 4]> <S3: lm>
#> 6 Australia Oceania <tibble [12 × 4]> <S3: lm>
#> # ... with 136 more rows
by_country %>%
filter(continent == "Europe")
#> # A tibble: 30 x 4
#> country continent data model
#> <fct> <fct> <list> <list>
#> 1 Albania Europe <tibble [12 × 4]> <S3: lm>
#> 2 Austria Europe <tibble [12 × 4]> <S3: lm>
#> 3 Belgium Europe <tibble [12 × 4]> <S3: lm>
#> 4 Bosnia and Herzegovina Europe <tibble [12 × 4]> <S3: lm>
#> 5 Bulgaria Europe <tibble [12 × 4]> <S3: lm>
#> 6 Croatia Europe <tibble [12 × 4]> <S3: lm>
#> # ... with 24 more rows
by_country %>%
arrange(continent, country)
#> # A tibble: 142 x 4
#> country continent data model
#> <fct> <fct> <list> <list>
#> 1 Algeria Africa <tibble [12 × 4]> <S3: lm>
#> 2 Angola Africa <tibble [12 × 4]> <S3: lm>
#> 3 Benin Africa <tibble [12 × 4]> <S3: lm>
#> 4 Botswana Africa <tibble [12 × 4]> <S3: lm>
#> 5 Burkina Faso Africa <tibble [12 × 4]> <S3: lm>
#> 6 Burundi Africa <tibble [12 × 4]> <S3: lm>
#> # ... with 136 more rows
再向其中添加残差:
by_country <- by_country %>%
mutate(
resids = map2(data, model, add_residuals)
)
by_country
#> # A tibble: 142 x 5
#> country continent data model resids
#> <fct> <fct> <list> <list> <list>
#> 1 Afghanistan Asia <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#> 2 Albania Europe <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#> 3 Algeria Africa <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#> 4 Angola Africa <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#> 5 Argentina Americas <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#> 6 Australia Oceania <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#> # ... with 136 more rows
再把数据集拆开unnest():
resids <- unnest(by_country, resids)
resids
#> # A tibble: 1,704 x 7
#> country continent year lifeExp pop gdpPercap resid
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779. -1.11
#> 2 Afghanistan Asia 1957 30.3 9240934 821. -0.952
#> 3 Afghanistan Asia 1962 32.0 10267083 853. -0.664
#> 4 Afghanistan Asia 1967 34.0 11537966 836. -0.0172
#> 5 Afghanistan Asia 1972 36.1 13079460 740. 0.674
#> 6 Afghanistan Asia 1977 38.4 14880372 786. 1.65
#> # ... with 1,698 more rows
现在再画出残差图:
resids %>%
ggplot(aes(year, resid)) +
geom_line(aes(group = country), alpha = 1 / 3) +
geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
或者按洲分组画出:
resids %>%
ggplot(aes(year, resid, group = country)) +
geom_line(alpha = 1 / 3) +
facet_wrap(~continent)
为了评价模型,可以采用broom::glance()提出一些模型质量参数:
broom::glance(nz_mod)
#> # A tibble: 1 x 11
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.954 0.949 0.804 205. 5.41e-8 2 -13.3 32.6 34.1
#> # ... with 2 more variables: deviance <dbl>, df.residual <int>
再通过map()函数对每一个模型添加glance():
by_country %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance)
#> # A tibble: 142 x 16
#> country continent data model resids r.squared adj.r.squared sigma
#> <fct> <fct> <lis> <lis> <list> <dbl> <dbl> <dbl>
#> 1 Afghan… Asia <tib… <S3:… <tibb… 0.948 0.942 1.22
#> 2 Albania Europe <tib… <S3:… <tibb… 0.911 0.902 1.98
#> 3 Algeria Africa <tib… <S3:… <tibb… 0.985 0.984 1.32
#> 4 Angola Africa <tib… <S3:… <tibb… 0.888 0.877 1.41
#> 5 Argent… Americas <tib… <S3:… <tibb… 0.996 0.995 0.292
#> 6 Austra… Oceania <tib… <S3:… <tibb… 0.980 0.978 0.621
#> # ... with 136 more rows, and 8 more variables: statistic <dbl>,
#> # p.value <dbl>, df <int>, logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> # deviance <dbl>, df.residual <int>
而若想丢掉data, model, resids这些嵌套数据,可设置.drop = TRUE:
glance <- by_country %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance, .drop = TRUE)
glance
#> # A tibble: 142 x 13
#> country continent r.squared adj.r.squared sigma statistic p.value df
#> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 Afghan… Asia 0.948 0.942 1.22 181. 9.84e- 8 2
#> 2 Albania Europe 0.911 0.902 1.98 102. 1.46e- 6 2
#> 3 Algeria Africa 0.985 0.984 1.32 662. 1.81e-10 2
#> 4 Angola Africa 0.888 0.877 1.41 79.1 4.59e- 6 2
#> 5 Argent… Americas 0.996 0.995 0.292 2246. 4.22e-13 2
#> 6 Austra… Oceania 0.980 0.978 0.621 481. 8.67e-10 2
#> # ... with 136 more rows, and 5 more variables: logLik <dbl>, AIC <dbl>,
#> # BIC <dbl>, deviance <dbl>, df.residual <int>
再根据r.squared对模型好坏进行排序:
glance %>%
arrange(r.squared)
#> # A tibble: 142 x 13
#> country continent r.squared adj.r.squared sigma statistic p.value df
#> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 Rwanda Africa 0.0172 -0.0811 6.56 0.175 0.685 2
#> 2 Botswa… Africa 0.0340 -0.0626 6.11 0.352 0.566 2
#> 3 Zimbab… Africa 0.0562 -0.0381 7.21 0.596 0.458 2
#> 4 Zambia Africa 0.0598 -0.0342 4.53 0.636 0.444 2
#> 5 Swazil… Africa 0.0682 -0.0250 6.64 0.732 0.412 2
#> 6 Lesotho Africa 0.0849 -0.00666 5.93 0.927 0.358 2
#> # ... with 136 more rows, and 5 more variables: logLik <dbl>, AIC <dbl>,
#> # BIC <dbl>, deviance <dbl>, df.residual <int>
glance %>%
ggplot(aes(continent, r.squared)) +
geom_jitter(width = 0.5)
从中挑出很糟的模型:
bad_fit <- filter(glance, r.squared < 0.25)
gapminder %>%
semi_join(bad_fit, by = "country") %>%
ggplot(aes(year, lifeExp, colour = country)) +
geom_line()
list-columns数据结构
上述所有变换全部基于list-columns数据结构:
data.frame(x = list(1:3, 3:5))
#> x.1.3 x.3.5
#> 1 1 3
#> 2 2 4
#> 3 3 5
若要取消这种变换特性可以使用I():
data.frame(
x = I(list(1:3, 3:5)),
y = c("1, 2", "3, 4, 5")
)
#> x y
#> 1 1, 2, 3 1, 2
#> 2 3, 4, 5 3, 4, 5
而用tibble()则完全不会有这种转换策略:
tibble(
x = list(1:3, 3:5),
y = c("1, 2", "3, 4, 5")
)
#> # A tibble: 2 x 2
#> x y
#> <list> <chr>
#> 1 <int [3]> 1, 2
#> 2 <int [3]> 3, 4, 5
也可以用tribble():
tribble(
~x, ~y,
1:3, "1, 2",
3:5, "3, 4, 5"
)
#> # A tibble: 2 x 2
#> x y
#> <list> <chr>
#> 1 <int [3]> 1, 2
#> 2 <int [3]> 3, 4, 5
通常创建list-column用pipes有以下几步:
- nest(), summarise() + list()或者mutate() + map()
- 直接map(), map2(), pmap()
- 将list-column简化成数据集
创建list-columns
- tidyr::nest()
- mutate()
- summarise()
- tibble:enframe()
nest()嵌套
apminder %>%
group_by(country, continent) %>%
nest()
#> # A tibble: 142 x 3
#> country continent data
#> <fct> <fct> <list>
#> 1 Afghanistan Asia <tibble [12 × 4]>
#> 2 Albania Europe <tibble [12 × 4]>
#> 3 Algeria Africa <tibble [12 × 4]>
#> 4 Angola Africa <tibble [12 × 4]>
#> 5 Argentina Americas <tibble [12 × 4]>
#> 6 Australia Oceania <tibble [12 × 4]>
#> # ... with 136 more rows
或者直接不分组嵌套:
gapminder %>%
nest(year:gdpPercap)
#> # A tibble: 142 x 3
#> country continent data
#> <fct> <fct> <list>
#> 1 Afghanistan Asia <tibble [12 × 4]>
#> 2 Albania Europe <tibble [12 × 4]>
#> 3 Algeria Africa <tibble [12 × 4]>
#> 4 Angola Africa <tibble [12 × 4]>
#> 5 Argentina Americas <tibble [12 × 4]>
#> 6 Australia Oceania <tibble [12 × 4]>
#> # ... with 136 more rows
mutate()嵌套:
df <- tribble(
~x1,
"a,b,c",
"d,e,f,g"
)
df %>%
mutate(x2 = stringr::str_split(x1, ","))
#> # A tibble: 2 x 2
#> x1 x2
#> <chr> <list>
#> 1 a,b,c <chr [3]>
#> 2 d,e,f,g <chr [4]>
这种情况下即使没有对数据集进行nest(),也可以用unnest()处理:
df %>%
mutate(x2 = stringr::str_split(x1, ",")) %>%
unnest()
#> # A tibble: 7 x 2
#> x1 x2
#> <chr> <chr>
#> 1 a,b,c a
#> 2 a,b,c b
#> 3 a,b,c c
#> 4 d,e,f,g d
#> 5 d,e,f,g e
#> 6 d,e,f,g f
#> # ... with 1 more row
另外一个map()的例子:
sim <- tribble(
~f, ~params,
"runif", list(min = -1, max = 1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim %>%
mutate(sims = invoke_map(f, params, n = 10))
#> # A tibble: 3 x 3
#> f params sims
#> <chr> <list> <list>
#> 1 runif <list [2]> <dbl [10]>
#> 2 rnorm <list [1]> <dbl [10]>
#> 3 rpois <list [1]> <int [10]>
与summaries()联用,可用list()使本身只能产生一个结果的函数产生多个结果并放在list中一起返回:
mtcars %>%
group_by(cyl) %>%
summarise(q = quantile(mpg))
#> Error in summarise_impl(.data, dots): Column `q` must be length 1 (a summary value), not 5
mtcars %>%
group_by(cyl) %>%
summarise(q = list(quantile(mpg)))
#> # A tibble: 3 x 2
#> cyl q
#> <dbl> <list>
#> 1 4 <dbl [5]>
#> 2 6 <dbl [5]>
#> 3 8 <dbl [5]>
probs <- c(0.01, 0.25, 0.5, 0.75, 0.99)
mtcars %>%
group_by(cyl) %>%
summarise(p = list(probs), q = list(quantile(mpg, probs))) %>%
unnest()
#> # A tibble: 15 x 3
#> cyl p q
#> <dbl> <dbl> <dbl>
#> 1 4 0.01 21.4
#> 2 4 0.25 22.8
#> 3 4 0.5 26
#> 4 4 0.75 30.4
#> 5 4 0.99 33.8
#> 6 6 0.01 17.8
#> # ... with 9 more rows
对一个list直接进行tibble::enframe()生成数据集用法:
x <- list(
a = 1:5,
b = 3:4,
c = 5:6
)
df <- enframe(x)
df
#> # A tibble: 3 x 2
#> name value
#> <chr> <list>
#> 1 a <int [5]>
#> 2 b <int [2]>
#> 3 c <int [2]>
这样就做到一行中一个为标签值,一个为所有数值,如果想对其进行操作,推荐map函数:
df %>%
mutate(
smry = map2_chr(name, value, ~ stringr::str_c(.x, ": ", .y[1]))
)
#> # A tibble: 3 x 3
#> name value smry
#> <chr> <list> <chr>
#> 1 a <int [5]> a: 1
#> 2 b <int [2]> b: 3
#> 3 c <int [2]> c: 5
简化list-columns
为了对数据集进行操纵和可视化,需要将list-column变回元数据:
- 如果想得到单一值,联用mutate()和map_lgl(), map_int(), map_dbl(), map_chr()
- 如果想得到多个值用unnest()
df <- tribble(
~x,
letters[1:5],
1:3,
runif(5)
)
df %>% mutate(
type = map_chr(x, typeof),
length = map_int(x, length)
)
#> # A tibble: 3 x 3
#> x type length
#> <list> <chr> <int>
#> 1 <chr [5]> character 5
#> 2 <int [3]> integer 3
#> 3 <dbl [5]> double 5
df <- tribble(
~x,
list(a = 1, b = 2),
list(a = 2, c = 4)
)
df %>% mutate(
a = map_dbl(x, "a"),
b = map_dbl(x, "b", .null = NA_real_)
)
#> # A tibble: 2 x 3
#> x a b
#> <list> <dbl> <dbl>
#> 1 <list [2]> 1 2
#> 2 <list [2]> 2 NA
unnest()的操作就简单许多:
tibble(x = 1:2, y = list(1:4, 1)) %>% unnest(y)
#> # A tibble: 5 x 2
#> x y
#> <int> <dbl>
#> 1 1 1
#> 2 1 2
#> 3 1 3
#> 4 1 4
#> 5 2 1
这里注意,unnest()只能一列一列解嵌套,一个x值对应list中的一个组分:
# Ok, because y and z have the same number of elements in
# every row
df1 <- tribble(
~x, ~y, ~z,
1, c("a", "b"), 1:2,
2, "c", 3
)
df1
#> # A tibble: 2 x 3
#> x y z
#> <dbl> <list> <list>
#> 1 1 <chr [2]> <int [2]>
#> 2 2 <chr [1]> <dbl [1]>
df1 %>% unnest(y, z)
#> # A tibble: 3 x 3
#> x y z
#> <dbl> <chr> <dbl>
#> 1 1 a 1
#> 2 1 b 2
#> 3 2 c 3
# Doesn't work because y and z have different number of elements
df2 <- tribble(
~x, ~y, ~z,
1, "a", 1:2,
2, c("b", "c"), 3
)
df2
#> # A tibble: 2 x 3
#> x y z
#> <dbl> <list> <list>
#> 1 1 <chr [1]> <int [2]>
#> 2 2 <chr [2]> <dbl [1]>
df2 %>% unnest(y, z)
#> Error: All nested columns must have the same number of elements.
broom包制作tidy data
broom包可以有效地将模型参数做成tidy格式地数据集:
- broom::glance(model): 对每个模型生成一行总结模型信息地数据行
- broom::tidy(model): 返回一行总结模型参数以及预测值
- broom::augment(model, data): 返回行对数据集中的每一行添加残差,统计量影响等参数
以上代码已上传Github Click Here