From: Out-of-core classification of text documents
Code:
""" ====================================================== Out-of-core classification of text documents ====================================================== This is an example showing how scikit-learn can be used for classification using an out-of-core approach: learning from data that doesn't fit into main memory. We make use of an online classifier, i.e., one that supports the partial_fit method, that will be fed with batches of examples. To guarantee that the features space remains the same over time we leverage a HashingVectorizer that will project each example into the same feature space. This is especially useful in the case of text classification where new features (words) may appear in each batch. The dataset used in this example is Reuters-21578 as provided by the UCI ML repository. It will be automatically downloaded and uncompressed on first run. The plot represents the learning curve of the classifier: the evolution of classification accuracy over the course of the mini-batches. Accuracy is measured on the first 1000 samples, held out as a validation set. To limit the memory consumption, we queue examples up to a fixed amount before feeding them to the learner. """ # Authors: Eustache Diemert <eustache@diemert.fr> # @FedericoV <https://github.com/FedericoV/> # License: BSD 3 clause from __future__ import print_function from glob import glob import itertools import os.path import re import tarfile import time import numpy as np import matplotlib.pyplot as plt from matplotlib import rcParams from sklearn.externals.six.moves import html_parser from sklearn.externals.six.moves import urllib from sklearn.datasets import get_data_home from sklearn.feature_extraction.text import HashingVectorizer from sklearn.linear_model import SGDClassifier from sklearn.linear_model import PassiveAggressiveClassifier from sklearn.linear_model import Perceptron from sklearn.naive_bayes import MultinomialNB def _not_in_sphinx(): # Hack to detect whether we are running by the sphinx builder return '__file__' in globals() ############################################################################### # Reuters Dataset related routines # -------------------------------- #