R for Data Science总结之——modelr(3)

这一章中我们将对大型数据集进行分组建模,深入挖掘数据集特征:

library(modelr)
library(tidyverse)
library(gapminder)
gapminder
#> # A tibble: 1,704 x 6
#>   country     continent  year lifeExp      pop gdpPercap
#>   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Afghanistan Asia       1952    28.8  8425333      779.
#> 2 Afghanistan Asia       1957    30.3  9240934      821.
#> 3 Afghanistan Asia       1962    32.0 10267083      853.
#> 4 Afghanistan Asia       1967    34.0 11537966      836.
#> 5 Afghanistan Asia       1972    36.1 13079460      740.
#> 6 Afghanistan Asia       1977    38.4 14880372      786.
#> # ... with 1,698 more rows

这里主要挖掘人均寿命lifeExp与年份year以及国家country之间的关系,首先作图:

gapminder %>% 
  ggplot(aes(year, lifeExp, group = country)) +
    geom_line(alpha = 1/3)

R for Data Science总结之——modelr(3)
如果我们挑出其中某一个国家,其结果与前几章相似:

nz <- filter(gapminder, country == "New Zealand")
nz %>% 
  ggplot(aes(year, lifeExp)) + 
  geom_line() + 
  ggtitle("Full data = ")

nz_mod <- lm(lifeExp ~ year, data = nz)
nz %>% 
  add_predictions(nz_mod) %>%
  ggplot(aes(year, pred)) + 
  geom_line() + 
  ggtitle("Linear trend + ")

nz %>% 
  add_residuals(nz_mod) %>% 
  ggplot(aes(year, resid)) + 
  geom_hline(yintercept = 0, colour = "white", size = 3) + 
  geom_line() + 
  ggtitle("Remaining pattern")

R for Data Science总结之——modelr(3)
而对一个数据集的深入挖掘需要对每个国家分别建模,这里用到purrr包的一个map函数,而对整个数据集进行操作,我们需要用到一个嵌套的数据集,这里可以先用group_by()分组,再nest()嵌套:

by_country <- gapminder %>% 
  group_by(country, continent) %>% 
  nest()

by_country
#> # A tibble: 142 x 3
#>   country     continent data             
#>   <fct>       <fct>     <list>           
#> 1 Afghanistan Asia      <tibble [12 × 4]>
#> 2 Albania     Europe    <tibble [12 × 4]>
#> 3 Algeria     Africa    <tibble [12 × 4]>
#> 4 Angola      Africa    <tibble [12 × 4]>
#> 5 Argentina   Americas  <tibble [12 × 4]>
#> 6 Australia   Oceania   <tibble [12 × 4]>
#> # ... with 136 more rows

我们从中抽出一个数据集:

by_country$data[[1]]
#> # A tibble: 12 x 4
#>    year lifeExp      pop gdpPercap
#>   <int>   <dbl>    <int>     <dbl>
#> 1  1952    28.8  8425333      779.
#> 2  1957    30.3  9240934      821.
#> 3  1962    32.0 10267083      853.
#> 4  1967    34.0 11537966      836.
#> 5  1972    36.1 13079460      740.
#> 6  1977    38.4 14880372      786.
#> # ... with 6 more rows

这里可以看出,一个分组的数据集,每一行是一个观测值,而一个嵌套数据集,每一行是一个组。
这里再创建一个函数用于对每一个组进行建模:

country_model <- function(df) {
  lm(lifeExp ~ year, data = df)
}

然后可以将其加到每一个组:

models <- map(by_country$data, country_model)

我们再在嵌套数据集中加上模型这个变量:

by_country <- by_country %>% 
  mutate(model = map(data, country_model))
by_country
#> # A tibble: 142 x 4
#>   country     continent data              model   
#>   <fct>       <fct>     <list>            <list>  
#> 1 Afghanistan Asia      <tibble [12 × 4]> <S3: lm>
#> 2 Albania     Europe    <tibble [12 × 4]> <S3: lm>
#> 3 Algeria     Africa    <tibble [12 × 4]> <S3: lm>
#> 4 Angola      Africa    <tibble [12 × 4]> <S3: lm>
#> 5 Argentina   Americas  <tibble [12 × 4]> <S3: lm>
#> 6 Australia   Oceania   <tibble [12 × 4]> <S3: lm>
#> # ... with 136 more rows
by_country %>% 
  filter(continent == "Europe")
#> # A tibble: 30 x 4
#>   country                continent data              model   
#>   <fct>                  <fct>     <list>            <list>  
#> 1 Albania                Europe    <tibble [12 × 4]> <S3: lm>
#> 2 Austria                Europe    <tibble [12 × 4]> <S3: lm>
#> 3 Belgium                Europe    <tibble [12 × 4]> <S3: lm>
#> 4 Bosnia and Herzegovina Europe    <tibble [12 × 4]> <S3: lm>
#> 5 Bulgaria               Europe    <tibble [12 × 4]> <S3: lm>
#> 6 Croatia                Europe    <tibble [12 × 4]> <S3: lm>
#> # ... with 24 more rows
by_country %>% 
  arrange(continent, country)
#> # A tibble: 142 x 4
#>   country      continent data              model   
#>   <fct>        <fct>     <list>            <list>  
#> 1 Algeria      Africa    <tibble [12 × 4]> <S3: lm>
#> 2 Angola       Africa    <tibble [12 × 4]> <S3: lm>
#> 3 Benin        Africa    <tibble [12 × 4]> <S3: lm>
#> 4 Botswana     Africa    <tibble [12 × 4]> <S3: lm>
#> 5 Burkina Faso Africa    <tibble [12 × 4]> <S3: lm>
#> 6 Burundi      Africa    <tibble [12 × 4]> <S3: lm>
#> # ... with 136 more rows

再向其中添加残差:

by_country <- by_country %>% 
  mutate(
    resids = map2(data, model, add_residuals)
  )
by_country
#> # A tibble: 142 x 5
#>   country     continent data              model    resids           
#>   <fct>       <fct>     <list>            <list>   <list>           
#> 1 Afghanistan Asia      <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#> 2 Albania     Europe    <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#> 3 Algeria     Africa    <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#> 4 Angola      Africa    <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#> 5 Argentina   Americas  <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#> 6 Australia   Oceania   <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#> # ... with 136 more rows

再把数据集拆开unnest():

resids <- unnest(by_country, resids)
resids
#> # A tibble: 1,704 x 7
#>   country     continent  year lifeExp      pop gdpPercap   resid
#>   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>   <dbl>
#> 1 Afghanistan Asia       1952    28.8  8425333      779. -1.11  
#> 2 Afghanistan Asia       1957    30.3  9240934      821. -0.952 
#> 3 Afghanistan Asia       1962    32.0 10267083      853. -0.664 
#> 4 Afghanistan Asia       1967    34.0 11537966      836. -0.0172
#> 5 Afghanistan Asia       1972    36.1 13079460      740.  0.674 
#> 6 Afghanistan Asia       1977    38.4 14880372      786.  1.65  
#> # ... with 1,698 more rows

现在再画出残差图:

resids %>% 
  ggplot(aes(year, resid)) +
    geom_line(aes(group = country), alpha = 1 / 3) + 
    geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

R for Data Science总结之——modelr(3)
或者按洲分组画出:

resids %>% 
  ggplot(aes(year, resid, group = country)) +
    geom_line(alpha = 1 / 3) + 
    facet_wrap(~continent)

R for Data Science总结之——modelr(3)
为了评价模型,可以采用broom::glance()提出一些模型质量参数:

broom::glance(nz_mod)
#> # A tibble: 1 x 11
#>   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
#> *     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
#> 1     0.954         0.949 0.804      205. 5.41e-8     2  -13.3  32.6  34.1
#> # ... with 2 more variables: deviance <dbl>, df.residual <int>

再通过map()函数对每一个模型添加glance():

by_country %>% 
  mutate(glance = map(model, broom::glance)) %>% 
  unnest(glance)
#> # A tibble: 142 x 16
#>   country continent data  model resids r.squared adj.r.squared sigma
#>   <fct>   <fct>     <lis> <lis> <list>     <dbl>         <dbl> <dbl>
#> 1 Afghan… Asia      <tib… <S3:… <tibb…     0.948         0.942 1.22 
#> 2 Albania Europe    <tib… <S3:… <tibb…     0.911         0.902 1.98 
#> 3 Algeria Africa    <tib… <S3:… <tibb…     0.985         0.984 1.32 
#> 4 Angola  Africa    <tib… <S3:… <tibb…     0.888         0.877 1.41 
#> 5 Argent… Americas  <tib… <S3:… <tibb…     0.996         0.995 0.292
#> 6 Austra… Oceania   <tib… <S3:… <tibb…     0.980         0.978 0.621
#> # ... with 136 more rows, and 8 more variables: statistic <dbl>,
#> #   p.value <dbl>, df <int>, logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> #   deviance <dbl>, df.residual <int>

而若想丢掉data, model, resids这些嵌套数据,可设置.drop = TRUE:

glance <- by_country %>% 
  mutate(glance = map(model, broom::glance)) %>% 
  unnest(glance, .drop = TRUE)
glance
#> # A tibble: 142 x 13
#>   country continent r.squared adj.r.squared sigma statistic  p.value    df
#>   <fct>   <fct>         <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>
#> 1 Afghan… Asia          0.948         0.942 1.22      181.  9.84e- 8     2
#> 2 Albania Europe        0.911         0.902 1.98      102.  1.46e- 6     2
#> 3 Algeria Africa        0.985         0.984 1.32      662.  1.81e-10     2
#> 4 Angola  Africa        0.888         0.877 1.41       79.1 4.59e- 6     2
#> 5 Argent… Americas      0.996         0.995 0.292    2246.  4.22e-13     2
#> 6 Austra… Oceania       0.980         0.978 0.621     481.  8.67e-10     2
#> # ... with 136 more rows, and 5 more variables: logLik <dbl>, AIC <dbl>,
#> #   BIC <dbl>, deviance <dbl>, df.residual <int>

再根据r.squared对模型好坏进行排序:

glance %>% 
  arrange(r.squared)
#> # A tibble: 142 x 13
#>   country continent r.squared adj.r.squared sigma statistic p.value    df
#>   <fct>   <fct>         <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>
#> 1 Rwanda  Africa       0.0172      -0.0811   6.56     0.175   0.685     2
#> 2 Botswa… Africa       0.0340      -0.0626   6.11     0.352   0.566     2
#> 3 Zimbab… Africa       0.0562      -0.0381   7.21     0.596   0.458     2
#> 4 Zambia  Africa       0.0598      -0.0342   4.53     0.636   0.444     2
#> 5 Swazil… Africa       0.0682      -0.0250   6.64     0.732   0.412     2
#> 6 Lesotho Africa       0.0849      -0.00666  5.93     0.927   0.358     2
#> # ... with 136 more rows, and 5 more variables: logLik <dbl>, AIC <dbl>,
#> #   BIC <dbl>, deviance <dbl>, df.residual <int>
glance %>% 
  ggplot(aes(continent, r.squared)) + 
    geom_jitter(width = 0.5)

R for Data Science总结之——modelr(3)
从中挑出很糟的模型:

bad_fit <- filter(glance, r.squared < 0.25)

gapminder %>% 
  semi_join(bad_fit, by = "country") %>% 
  ggplot(aes(year, lifeExp, colour = country)) +
    geom_line()

R for Data Science总结之——modelr(3)

list-columns数据结构

上述所有变换全部基于list-columns数据结构:

data.frame(x = list(1:3, 3:5))
#>   x.1.3 x.3.5
#> 1     1     3
#> 2     2     4
#> 3     3     5

若要取消这种变换特性可以使用I():

data.frame(
  x = I(list(1:3, 3:5)), 
  y = c("1, 2", "3, 4, 5")
)
#>         x       y
#> 1 1, 2, 3    1, 2
#> 2 3, 4, 5 3, 4, 5

而用tibble()则完全不会有这种转换策略:

tibble(
  x = list(1:3, 3:5), 
  y = c("1, 2", "3, 4, 5")
)
#> # A tibble: 2 x 2
#>   x         y      
#>   <list>    <chr>  
#> 1 <int [3]> 1, 2   
#> 2 <int [3]> 3, 4, 5

也可以用tribble():

tribble(
   ~x, ~y,
  1:3, "1, 2",
  3:5, "3, 4, 5"
)
#> # A tibble: 2 x 2
#>   x         y      
#>   <list>    <chr>  
#> 1 <int [3]> 1, 2   
#> 2 <int [3]> 3, 4, 5

通常创建list-column用pipes有以下几步:

  • nest(), summarise() + list()或者mutate() + map()
  • 直接map(), map2(), pmap()
  • 将list-column简化成数据集

创建list-columns

  • tidyr::nest()
  • mutate()
  • summarise()
  • tibble:enframe()

nest()嵌套

apminder %>% 
  group_by(country, continent) %>% 
  nest()
#> # A tibble: 142 x 3
#>   country     continent data             
#>   <fct>       <fct>     <list>           
#> 1 Afghanistan Asia      <tibble [12 × 4]>
#> 2 Albania     Europe    <tibble [12 × 4]>
#> 3 Algeria     Africa    <tibble [12 × 4]>
#> 4 Angola      Africa    <tibble [12 × 4]>
#> 5 Argentina   Americas  <tibble [12 × 4]>
#> 6 Australia   Oceania   <tibble [12 × 4]>
#> # ... with 136 more rows

或者直接不分组嵌套:

gapminder %>% 
  nest(year:gdpPercap)
#> # A tibble: 142 x 3
#>   country     continent data             
#>   <fct>       <fct>     <list>           
#> 1 Afghanistan Asia      <tibble [12 × 4]>
#> 2 Albania     Europe    <tibble [12 × 4]>
#> 3 Algeria     Africa    <tibble [12 × 4]>
#> 4 Angola      Africa    <tibble [12 × 4]>
#> 5 Argentina   Americas  <tibble [12 × 4]>
#> 6 Australia   Oceania   <tibble [12 × 4]>
#> # ... with 136 more rows

mutate()嵌套:

df <- tribble(
  ~x1,
  "a,b,c", 
  "d,e,f,g"
) 

df %>% 
  mutate(x2 = stringr::str_split(x1, ","))
#> # A tibble: 2 x 2
#>   x1      x2       
#>   <chr>   <list>   
#> 1 a,b,c   <chr [3]>
#> 2 d,e,f,g <chr [4]>

这种情况下即使没有对数据集进行nest(),也可以用unnest()处理:

df %>% 
  mutate(x2 = stringr::str_split(x1, ",")) %>% 
  unnest()
#> # A tibble: 7 x 2
#>   x1      x2   
#>   <chr>   <chr>
#> 1 a,b,c   a    
#> 2 a,b,c   b    
#> 3 a,b,c   c    
#> 4 d,e,f,g d    
#> 5 d,e,f,g e    
#> 6 d,e,f,g f    
#> # ... with 1 more row

另外一个map()的例子:

sim <- tribble(
  ~f,      ~params,
  "runif", list(min = -1, max = 1),
  "rnorm", list(sd = 5),
  "rpois", list(lambda = 10)
)

sim %>%
  mutate(sims = invoke_map(f, params, n = 10))
#> # A tibble: 3 x 3
#>   f     params     sims      
#>   <chr> <list>     <list>    
#> 1 runif <list [2]> <dbl [10]>
#> 2 rnorm <list [1]> <dbl [10]>
#> 3 rpois <list [1]> <int [10]>

与summaries()联用,可用list()使本身只能产生一个结果的函数产生多个结果并放在list中一起返回:

mtcars %>% 
  group_by(cyl) %>% 
  summarise(q = quantile(mpg))
#> Error in summarise_impl(.data, dots): Column `q` must be length 1 (a summary value), not 5

mtcars %>% 
  group_by(cyl) %>% 
  summarise(q = list(quantile(mpg)))
#> # A tibble: 3 x 2
#>     cyl q        
#>   <dbl> <list>   
#> 1     4 <dbl [5]>
#> 2     6 <dbl [5]>
#> 3     8 <dbl [5]>
probs <- c(0.01, 0.25, 0.5, 0.75, 0.99)
mtcars %>% 
  group_by(cyl) %>% 
  summarise(p = list(probs), q = list(quantile(mpg, probs))) %>% 
  unnest()
#> # A tibble: 15 x 3
#>     cyl     p     q
#>   <dbl> <dbl> <dbl>
#> 1     4  0.01  21.4
#> 2     4  0.25  22.8
#> 3     4  0.5   26  
#> 4     4  0.75  30.4
#> 5     4  0.99  33.8
#> 6     6  0.01  17.8
#> # ... with 9 more rows

对一个list直接进行tibble::enframe()生成数据集用法:

x <- list(
  a = 1:5,
  b = 3:4, 
  c = 5:6
) 

df <- enframe(x)
df
#> # A tibble: 3 x 2
#>   name  value    
#>   <chr> <list>   
#> 1 a     <int [5]>
#> 2 b     <int [2]>
#> 3 c     <int [2]>

这样就做到一行中一个为标签值,一个为所有数值,如果想对其进行操作,推荐map函数:

df %>% 
  mutate(
    smry = map2_chr(name, value, ~ stringr::str_c(.x, ": ", .y[1]))
  )
#> # A tibble: 3 x 3
#>   name  value     smry 
#>   <chr> <list>    <chr>
#> 1 a     <int [5]> a: 1 
#> 2 b     <int [2]> b: 3 
#> 3 c     <int [2]> c: 5

简化list-columns

为了对数据集进行操纵和可视化,需要将list-column变回元数据:

  • 如果想得到单一值,联用mutate()和map_lgl(), map_int(), map_dbl(), map_chr()
  • 如果想得到多个值用unnest()
df <- tribble(
  ~x,
  letters[1:5],
  1:3,
  runif(5)
)
  
df %>% mutate(
  type = map_chr(x, typeof),
  length = map_int(x, length)
)
#> # A tibble: 3 x 3
#>   x         type      length
#>   <list>    <chr>      <int>
#> 1 <chr [5]> character      5
#> 2 <int [3]> integer        3
#> 3 <dbl [5]> double         5
df <- tribble(
  ~x,
  list(a = 1, b = 2),
  list(a = 2, c = 4)
)
df %>% mutate(
  a = map_dbl(x, "a"),
  b = map_dbl(x, "b", .null = NA_real_)
)
#> # A tibble: 2 x 3
#>   x              a     b
#>   <list>     <dbl> <dbl>
#> 1 <list [2]>     1     2
#> 2 <list [2]>     2    NA

unnest()的操作就简单许多:

tibble(x = 1:2, y = list(1:4, 1)) %>% unnest(y)
#> # A tibble: 5 x 2
#>       x     y
#>   <int> <dbl>
#> 1     1     1
#> 2     1     2
#> 3     1     3
#> 4     1     4
#> 5     2     1

这里注意,unnest()只能一列一列解嵌套,一个x值对应list中的一个组分:

# Ok, because y and z have the same number of elements in
# every row
df1 <- tribble(
  ~x, ~y,           ~z,
   1, c("a", "b"), 1:2,
   2, "c",           3
)
df1
#> # A tibble: 2 x 3
#>       x y         z        
#>   <dbl> <list>    <list>   
#> 1     1 <chr [2]> <int [2]>
#> 2     2 <chr [1]> <dbl [1]>
df1 %>% unnest(y, z)
#> # A tibble: 3 x 3
#>       x y         z
#>   <dbl> <chr> <dbl>
#> 1     1 a         1
#> 2     1 b         2
#> 3     2 c         3

# Doesn't work because y and z have different number of elements
df2 <- tribble(
  ~x, ~y,           ~z,
   1, "a",         1:2,  
   2, c("b", "c"),   3
)
df2
#> # A tibble: 2 x 3
#>       x y         z        
#>   <dbl> <list>    <list>   
#> 1     1 <chr [1]> <int [2]>
#> 2     2 <chr [2]> <dbl [1]>
df2 %>% unnest(y, z)
#> Error: All nested columns must have the same number of elements.

broom包制作tidy data

broom包可以有效地将模型参数做成tidy格式地数据集:

  • broom::glance(model): 对每个模型生成一行总结模型信息地数据行
  • broom::tidy(model): 返回一行总结模型参数以及预测值
  • broom::augment(model, data): 返回行对数据集中的每一行添加残差,统计量影响等参数

以上代码已上传Github Click Here

相关文章: