使用sklearn提供的接口加载经典的MNIST手写数据集

import numpy as np
from sklearn.datasets import fetch_openml

mnist_data = fetch_openml("mnist_784")

X, y = mnist_data['data'], mnist_data['target']     # X.shape = (70000, 784)
X_train = np.array(X[:60000], dtype=float)
y_train = np.array(y[:60000], dtype=float)
X_test = np.array(X[60000:], dtype=float)
y_test = np.array(y[60000:], dtype=float)

KNN算法对MNIST分类

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
%time knn_clf.fit(X_train, y_train)   # Wall time: 1min 3s

%time knn_clf.score(X_test, y_test)   # Wall time = 15min 43, score = 0.9688

Note 1:
为什么KNN算法的fit要花费这么多时间?
因为当训练数据比较大的情况下,sklearn的KNN算法会使用tree来存储数据,而不是直接存储数据。

Note 2:
KNN算法通常需要对数据进行归一化,为什么这里没有做归一化?
因为当前的样本数据中,所有的特征都是表示图像中的像素点,整体处于同一个尺度,所以不需要归一化。

Note 3:
训练样本非常大的情况下,KNN算法非常耗时

PCA + KNN + MNIST

from sklearn.decomposition import PCA

pca = PCA(0.9)
pca.fit(X_train)
X_train_reduction = pca.transform(X_train)   # X_train_reductionduction.shape = (60000, 784)

knn_clf = KNeighborsClassifier()
%time knn_clf.fit(X_train_reduction, y_train)   # Wall time: 15.6 s

X_test_reduction = pca.transform(X_test)
%time knn_clf.score(X_test_reduction, y_test)   # Wall time = 2min 20s, score = 0.8728

运行时间减少的同时,预测准确率反面提高了
PCA在降维的同时还可以降噪