SAS

Module 6 Bagging, Forest, Boosting

Types of Error in Predictive Models:

  • Bias: Generally caused by under-fit models. The model is too simple to fit the complex real-life system. The more complex the model, the less bias it generally has.
  • Variances: Generally caused by over-fit models. Variability of the prediction that can be observed by taking different training data sets and using the same model
  • We always need to make tradeoffs between bias and variancesSAS Module 6 Advanced Techniques
    Therefore, we need ensemble models (combination of the predictions from the component models) to enhance the prediction accuracy such as Bagging, Forest and Boosting

Bagging:
For regression model:

  • Take repeated samples with replacement from the training set. Generate B different bootstrapped training data sets and generate B different decision trees
  • Calculate the mean of responses as final response
  • use mtry = p (number of predictors) 有多少个不同的predictor,就随机选择几个(可重复)放进training set
    SAS Module 6 Advanced Techniques
    For classification model:
  • Record the class that each tree assigns to each observation and vote
  • If the model provides probability estimates (logistic regression), we can also average the probabilities and choose the highest one

Random Forest:

  • Same principle as bagging, but more accurate
  • Bagging uses all predictors as split candidates, so some strong predictors may always be chosen so that the results are similar. Forest extend the bagging technique to limit SAS Module 6 Advanced Techniques
    to produce more variation among the trees in the ensemble. Random forests “de-correlates” the bagging trees leading to more reduction in variance
  • Forests tend to give better prediction than any specific tree, and often outperform other classes of models
  • Forests are hard to interpret, but they can be considered an “ideal” model for other models to be compared against

Gradient Boosting :Use errors to improve the model
Boosting Steps:

  1. Predict the response with a simple tree y0hat
  2. Find the error of this prediction e0
  3. Predict the error made by the previous tree e0hat
  4. Use the e0hat * shrinkage parameter + y0hat to have an improved predicted response y1hat
  5. Now, we got new error e1 from actual response y - y1hat
  6. Repeat these 5 steps for 100 times

相关文章: