Overview: Statistical Machine Learning
专有名词缩写
MSE(mean square error)
MSE(f ) = E(L(Y , f (X))) = E(Y − f (X))2
MCE(misclassification error)
MCE(f ) = E(L(Y , f (X))) = E(I(Y 6= f (X)))
Bias(fˆ(X)) = E(fˆ(X)) − f (X)
var(fˆ(X)) = E(fˆ(X) − E(fˆ(X)))2
Statistics and machine learning
“Different” terminologies:
| Machine Learning | Statistics |
|---|---|
| Supervised learning | Classification/regression |
| Unsupervised learning | Clustering |
| Semisupervised learning | Class’n/reg’n with missing responses |
| Manifold learning | (Nonlinear) dimension reduction |
Supervised learning :
for (x,y) x属于Rp,y属于R(x的维度是p)
可以通过训练,进行Classification/regression
Unsupervised learning
for x ,x属于Rp(x的维度是p),进行训练
可以进行一些聚类相关的操作
对于Semisupervised learning
some parts of its dataset contain the value y,
but most of its data are just x without y
for example, using python crawler to collect much data and tag some data by person
对于Manifold learning
???
| Parametric models | Nonparametric models |
|---|---|
| Linear/polynomial regression model | Local smoothing |
| Generalized linear regression model | Smoothing splines |
| Fisher’s discriminant analysis | Classification and regression trees; random forest; boosting |
| Logistic regression | Support vector machines |
| Deep learning |
models
prediction and inference
Classification
对于例子
进行classification的思路
1 Linear regression
2 Nearest neighbors
Left panel shows the result of 15-NN classifier; a few training
data are misclassified, and the decision boundary adapts to the
local density of the classes
Right panel shows the result of 1-NN classifier; none of the
training data is misclassified
Model assessment for regression
MSE(mean square error)
MSE(f ) = E(L(Y , f (X))) = E(Y − f (X))2
training error
test error
Model assessment for classification
MCE(misclassification error)
MCE(f ) = E(L(Y , f (X))) = E(I(Y 6= f (X)))
training error
test error
Validation set approach
If we have a large training set, we can estimate the test error by randomly splitting the data into training and validation parts Use the training part to build model, and then assess the model by applying it to the validation part
LOOCV
Split the data set of size n into
Training set with size n − 1
Validation set with size 1
Repeat this process n times