Hands-On Machine Learning with Scikit-Learn & TensorFlow Exercise Q&A Chapter02

For some unknown reason, my computer's Scikit-Learn package cannot deal with the ColumnTransformer function, so I never succeed in getting the housing_prepared data, thus the following answer is official answers and results.

Q1. Try a Support Vector Machine regressor (sklearn.svm.SVR), with various hyperparameters such as kernel="linear" (with various values for the C hyperparameter) or kernel="rbf" (with various values for the C and gamma hyperparameters) . Don't worry about what these hyperparameters mean for now. How does the best SVR predictor perform?

A1:

model:

from sklearn.model_selection import GridSearchCV

param_grid = [
        {'kernel': ['linear'], 'C': [10., 30., 100., 300., 1000., 3000., 10000., 30000.0]},
        {'kernel': ['rbf'], 'C': [1.0, 3.0, 10., 30., 100., 300., 1000.0],
         'gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},
    ]

svm_reg = SVR()
grid_search = GridSearchCV(svm_reg, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
grid_search.fit(housing_prepared, housing_labels)

evaluate:

negative_mse = grid_search.best_score_
rmse = np.sqrt(-negative_mse)
rmse
grid_search.best_params_

result:

Hands-On Machine Learning with Scikit-Learn & TensorFlow Exercise Q&A Chapter02

Q2. Try replacing GridSearchCV with RandomizedSearchCV.

A2:

model:

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import expon, reciprocal

param_distribs = {
        'kernel': ['linear', 'rbf'],
        'C': reciprocal(20, 200000),
        'gamma': expon(scale=1.0),
    }

svm_reg = SVR()
rnd_search = RandomizedSearchCV(svm_reg, param_distributions=param_distribs,
                                n_iter=50, cv=5, scoring='neg_mean_squared_error',
                                verbose=2, n_jobs=4, random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

evaluate:

negative_mse = rnd_search.best_score_
rmse = np.sqrt(-negative_mse)
rmse
rnd_search.best_params_

result:

Hands-On Machine Learning with Scikit-Learn & TensorFlow Exercise Q&A Chapter02

two types of visualizations:

exponential distribution:

expon_distrib = expon(scale=1.)
samples = expon_distrib.rvs(10000, random_state=42)
plt.figure(figsize=(10, 4))
plt.subplot(121)
plt.title("Exponential distribution (scale=1.0)")
plt.hist(samples, bins=50)
plt.subplot(122)
plt.title("Log of this distribution")
plt.hist(np.log(samples), bins=50)
plt.show()

Hands-On Machine Learning with Scikit-Learn & TensorFlow Exercise Q&A Chapter02

reciprocal distribution:

reciprocal_distrib = reciprocal(20, 200000)
samples = reciprocal_distrib.rvs(10000, random_state=42)
plt.figure(figsize=(10, 4))
plt.subplot(121)
plt.title("Reciprocal distribution (scale=1.0)")
plt.hist(samples, bins=50)
plt.subplot(122)
plt.title("Log of this distribution")
plt.hist(np.log(samples), bins=50)
plt.show()

Hands-On Machine Learning with Scikit-Learn & TensorFlow Exercise Q&A Chapter02

Q3. Try adding a transformer in the preparation pipeline to select only the most important attributes.

A3:

First we need a feature selector:

from sklearn.base import BaseEstimator, TransformerMixin

def indices_of_top_k(arr, k):
    return np.sort(np.argpartition(np.array(arr), -k)[-k:])

class TopFeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, feature_importances, k):
        self.feature_importances = feature_importances
        self.k = k
    def fit(self, X, y=None):
        self.feature_indices_ = indices_of_top_k(self.feature_importances, self.k)
        return self
    def transform(self, X):
        return X[:, self.feature_indices_]

This feature selector assumes that you have already computed the feature importances somehow (for example using a RandomForestRegressor). You may be tempted to compute them directly in the TopFeatureSelector's fit() method, however this would likely slow down grid/randomized search since the feature importances would have to be computed for every hyperparameter combination (unless you implement some sort of cache).

Secondly, we get the indices of the top k features:

k = 5

top_k_feature_indices = indices_of_top_k(feature_importances, k)
top_k_feature_indices

np.array(attributes)[top_k_feature_indices]

sorted(zip(feature_importances, attributes), reverse=True)[:k]

Hands-On Machine Learning with Scikit-Learn & TensorFlow Exercise Q&A Chapter02

Then, we build a new pipeline that runs the previously defined preparation pipeline and adds top k features selection:

preparation_and_feature_selection_pipeline = Pipeline([
    ('preparation', full_pipeline),
    ('feature_selection', TopFeatureSelector(feature_importances, k))
])

housing_prepared_top_k_features = preparation_and_feature_selection_pipeline.fit_transform(housing)

Laatly, we check the results:

housing_prepared_top_k_features[0:3]

housing_prepared[0:3, top_k_feature_indices]

Hands-On Machine Learning with Scikit-Learn & TensorFlow Exercise Q&A Chapter02

Q4. Try creating a single pipeline that does the full data preparation plus the final prediction.

A4:

Firstly, we combine the full_pipeline and TopFeatureSelector and SVR to create a new pipeline:

prepare_select_and_predict_pipeline = Pipeline([
    ('preparation', full_pipeline),
    ('feature_selection', TopFeatureSelector(feature_importances, k)),
    ('svm_reg', SVR(**rnd_search.best_params_))
])

prepare_select_and_predict_pipeline.fit(housing, housing_labels)

Then, we can use this pipeline for a few instances:

some_data = housing.iloc[:4]
some_labels = housing_labels.iloc[:4]

print("Predictions:\t", prepare_select_and_predict_pipeline.predict(some_data))
print("Labels:\t\t", list(some_labels))

Hands-On Machine Learning with Scikit-Learn & TensorFlow Exercise Q&A Chapter02

Q5. Automatically explore some preparation options using GridSearchCV.

A5:

Firstly, we use the pipeline we build in Q4 to train a GridSearchCV:

param_grid = [{
    'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'],
    'feature_selection__k': list(range(1, len(feature_importances) + 1))
}]

grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=5,
                                scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
grid_search_prep.fit(housing, housing_labels)

Then we check the best model:

grid_search_prep.best_params_

Hands-On Machine Learning with Scikit-Learn & TensorFlow Exercise Q&A Chapter02