What are the advantages of different classification algorithms?

12 ANSWERS

Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, your choice of classification algorithm might not really matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).

And if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize and Middle Earth, just use an ensemble method to choose them all!

841

Downvote

Related Questions

What are the pros and cons of the various Tree Ensemble classification algorithms?
What are the standard structured classification algorithms?
: What are the broad categories of classifiers?
Which classification algorithms output the most accurate probability estimates?

(more)

There are a number of dimensions you can look at to give you a sense of what will be a reasonable algorithm to start with, namely:

Number of training examples
Dimensionality of the feature space
Do I expect the problem to be linearly separable?
Are features independent?
Are features expected to be in a linear scale?
Is overfitting expected to be a problem?
What are the system's requirement in terms of speed/performance/memory usage...?
...

This list may seem a bit daunting because there are many issues that are not straightforward to answer. The good news though is, that as many problems in life, you can address this question by following the Occam's Razor principle: use the least complicated algorithm that can address your needs and only go for something more complicated if strictly necessary.

Logistic Regression

As a general rule of thumb, I would recommend to start with Logistic Regression. Logistic regression is a pretty well-behaved classification algorithm that can be trained as long as you expect your features to be roughly linear and the problem to be linearly separable. You can do some feature engineering to turn most non-linear features into linear pretty easily. It is also pretty robust to noise and you can avoid overfitting and even do feature selection by using l2 or l1 regularization. Logistic regression can also be used in Big Data scenarios since it is pretty efficient and can be distributed using, for example, ADMM (seelogreg). A final advantage of LR is that the output can be interpreted as a probability. This is something that comes as a nice side effect since you can use it, for example, for ranking instead of classification.

Even in a case where you would not expect Logistic Regression to work 100%, do yourself a favor and run a simple l2-regularized LR to come up with a baseline before you go into using "fancier" approaches.

Ok, so now that you have set your baseline with Logistic Regression, what should be your next step. I would basically recommend two possible directions: (1) SVM's, or (2) Tree Ensembles. If I knew nothing about your problem, I would definitely go for (2), but I will start with describing why SVM's might be something worth considering.

Support Vector Machines

Support Vector Machines (SVMs) use a different loss function (Hinge) from LR. They are also interpreted differently (maximum-margin). However, in practice, an SVM with a linear kernel is not very different from a Logistic Regression (If you are curious, you can see how Andrew Ng derives SVMs from Logistic Regression in his Coursera Machine Learning Course). The main reason you would want to use an SVM instead of a Logistic Regression is because your problem might not be linearly separable. In that case, you will have to use an SVM with a non linear kernel (e.g. RBF). The truth is that a Logistic Regression can also be used with a different kernel, but at that point you might be better off going for SVMs for practical reasons. Another related reason to use SVMs is if you are in a highly dimensional space. For example, SVMs have been reported to work better for text classification.

Unfortunately, the major downside of SVMs is that they can be painfully inefficient to train. So, I would not recommend them for any problem where you have many training examples. I would actually go even further and say that I would not recommend SVMs for most "industry scale" applications. Anything beyond a toy/lab problem might be better approached with a different algorithm.

Tree Ensembles

This gets me to the third family of algorithms: Tree Ensembles. This basically covers two distinct algorithms: Random Forests and Gradient Boosted Trees. I will talk about the differences later, but for now let me treat them as one for the purpose of comparing them to Logistic Regression.

Tree Ensembles have different advantages over LR. One main advantage is that they do not expect linear features or even features that interact linearly. Something I did not mention in LR is that it can hardly handle categorical (binary) features. Tree Ensembles, because they are nothing more than a bunch of Decision Trees combined, can handle this very well. The other main advantage is that, because of how they are constructed (using bagging or boosting) these algorithms handle very well high dimensional spaces as well as large number of training examples.

As for the difference between Random Forests (RF) and Gradient Boosted Decision Trees (GBDT), I won't go into many details, but one easy way to understand it is that GBDTs will usually perform better, but they are harder to get right. More concretely, GBDTs have more hyper-parameters to tune and are also more prone to overfitting. RFs can almost work "out of the box" and that is one reason why they are very popular.

Deep Learning

Last but not least, this answer would not be complete without at least a minor reference to Deep Learning. I would definitely not recommend this approach as a general-purpose technique for classification. But, you might probably have heard how well these methods perform in some cases such as image classification. If you have gone through the previous steps and still feel you can squeeze something out of your problem, you might want to use a Deep Learning approach. The truth is that if you use an open source implementation such as Theano, you can get an idea of how some of these approaches perform in your dataset pretty quickly.

Summary

So, recapping, start with something simple like Logistic Regression to set a baseline and only make it more complicated if you need to. At that point, tree ensembles, and in particular Random Forests since they are easy to tune, might be the right way to go. If you feel there is still room for improvement, try GBDT or get even fancier and go for Deep Learning.

You can also take a look at the Kaggle Competitions. If you search for the keyword "classification" and select those that are completed, you will get a good sense of what people used to win competitions that might be similar to your problem at hand. At that point you will probably realize that using an ensemble is always likely to make things better. The only problem with ensembles, of course, is that they require to maintain all the independent methods working in parallel. That might be your final step to get as fancy as it gets.

Downvote

ex-professor at Columbia University

(more)

Machine Learning Cheat Sheet (for scikit-learn)

What are the advantages of different classification algorithms?

293

Downvote

Ph.D. in Statistics.

(more)

A few important criterions should be addressed:

Does it require variables to be normally distributed?
Does it suffer multicollinearity issue?
Dose it do as well with categorical variables as continuous variables?
Does it calculate CI without CV?
Does it conduct variables selection without stepwise?
Does it apply to sparse data?

Here is the comparison:

Logistic regression: no distribution requirement, perform well with few categories categorical variables, compute the logistic distribution, good for few categories variables, easy to interpret, compute CI, suffer multicollinearity

Decision Trees: no distribution requirement, heuristic, good for few categories variables, not suffer multicollinearity (by choosing one of them)

NB: generally no requirements, good for few categories variables, compute the multiplication of independent distributions, suffer multicollinearity

LDA(Linear discriminant analysis not latent Dirichlet allocation):require normal, not good for few categories variables, compute the addition of Multivariate distribution, compute CI, suffer multicollinearity

SVM: no distribution requirement, compute hinge loss, flexible selection of kernels for nonlinear correlation, not suffer multicollinearity, hard to interpret

Lasso: no distribution requirement, compute L1 loss, variable selection, suffer multicollinearity

Ridge: no distribution requirement, compute L2 loss, no variable selection, not suffer multicollinearity

Bagging, boosting, ensemble methods(RF, Ada, etc): generally outperform single algorithm listed above.

Above all, Logistic regression is still the most widely used for its good features, but if the variables are normally distributed and the categorical variables all have 5+ categories, you may be surprised by the performance ofLDA, and if the correlations are mostly nonlinear, you can't beat a good SVM,
and if sparsity and multicollinearity are a concern, I would recommendAdaptive Lasso with Ridge(weights) + Lasso, this would suffice for most scenarios without much tuning.

And in the end if you need one fine tuned model, go for ensemble methods.

PS: Just see the sub question, with 10000 instances and more than 100000 features, the quick answer will be Lasso.

Downvote

Lead Data Scientist at The Backplane

(more)

Random forests - classification description

Downvote

(more)

http://www.amsta.leeds.ac.uk/~ch...

D. Michie, D.J. Spiegelhalter, C.C. Taylor (eds). Machine Learning, Neural and Statistical Classification

Downvote

Quora User, Text Analytics, Natural language proc... (more)

(more)

Decision Trees are fast to train and easy to evaluate and interrupt.
Support vector machine gives good accuracy, power of flexibility from kernels.
Neural network are slow to converge and hard to set parameters but if done with care it work wells
Bayesian classifiers are easy to understand.

Downvote

Comment

I play with computers

(more)

I'm currently working on these kinds of algorithms for newswire classification into about 10 categories. I'm comparing kNN, tweaked Naive Bayes and Rocchio's algorithm. I wanted very simple algorithms since my dataset is quite unlimited and because SVM, for instance, seems a pain to implement myself.

- kNN should be avoided in my case since the evaluation is quite heavy if your training dataset contains several thousand elements ; although it gives really good results.

- Naive Bayes is very simple and quicky to evaluate but I had to tweak it to handle unbalanced classes.

- Rocchio seems very naive but it works surprisingly well and is very efficient.

Finally, I use a combination of Naive Bayes and Rocchio to gain accuracy on the same principal than boosting (linear mixing obtained by cross-validation). You can also use EM on NB or Rocchio, since the formulation is very simple in these cases. This could help.

All in all, I'd say that this is very data-dependent. Best technics depends on the data and what accuracy/efficiency trade off you are expecting.

Downvote

Comment

(more)

Decision trees and rule based algorithms are good because you can understand the model that was built for classifying, unlike with neural networks.
Support Vector Machines work very well in many circumstances and perform very good with large amounts of data.
Association ones such as Apriori have an excellent performance, due to how the algorithm is build, and it always reaches the proper solution.
Naive Bayes mechanism is very simple to understand, it has also a high performance and is also easy to implement

Downvote

Comment

Calculation Consulting; we predict things

(more)

Page on gputechconf.com

and there are distributed memory, parallel implementations running on upto 1000 nodes.

Interpretation:
Linear SVMs are very easy to interpret--it is the non-linear case that is a bit tricky. Although any non-linear problem is going to require some work to interpret in that you would need to find a compact basis set to represent the non-linearities. Heck, If you know the basis set a-priori, you can just project your data onto this basis and run a linear SVM and the problem is trivial.

Downvote

Lynn Brock

Nervan Andrew.

Having tried linear/poly regression, logistic regression, neural nets, genetic programming, and SVM on a recent large/noisy data project I agree with most of the above. I'd add GP (Genetic Programming) to the mix when you can reduce the data via a set of (possibly parameterized) feature detectors, with some simple combination rules. GP is often slow to converge, and usually hard to understand the results (unless you have some parsimony built-in) - but gives more insight than SVM and Neural Nets.

Downvote

Comment

Naeem Siddiqi

Yuval Feinstein.

1. Whats the business problem ?

what you need to do, transparency and governance, regulatory needs, business interpretation issues and others will outweigh maths in a business.

2. Data determines the result more than the algorithm

if you have lots of good data, all algorithms are good. Multicollinearity is less of a problem with large datasets.

Downvote

Comment

4 Answers Collapsed

Write an answer