使用sklearn提供的接口加载经典的MNIST手写数据集
import numpy as np
from sklearn.datasets import fetch_openml
mnist_data = fetch_openml("mnist_784")
X, y = mnist_data['data'], mnist_data['target'] # X.shape = (70000, 784)
X_train = np.array(X[:60000], dtype=float)
y_train = np.array(y[:60000], dtype=float)
X_test = np.array(X[60000:], dtype=float)
y_test = np.array(y[60000:], dtype=float)
KNN算法对MNIST分类
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
%time knn_clf.fit(X_train, y_train) # Wall time: 1min 3s
%time knn_clf.score(X_test, y_test) # Wall time = 15min 43, score = 0.9688
Note 1:
为什么KNN算法的fit要花费这么多时间?
因为当训练数据比较大的情况下,sklearn的KNN算法会使用tree来存储数据,而不是直接存储数据。
Note 2:
KNN算法通常需要对数据进行归一化,为什么这里没有做归一化?
因为当前的样本数据中,所有的特征都是表示图像中的像素点,整体处于同一个尺度,所以不需要归一化。
Note 3:
训练样本非常大的情况下,KNN算法非常耗时
PCA + KNN + MNIST
from sklearn.decomposition import PCA
pca = PCA(0.9)
pca.fit(X_train)
X_train_reduction = pca.transform(X_train) # X_train_reductionduction.shape = (60000, 784)
knn_clf = KNeighborsClassifier()
%time knn_clf.fit(X_train_reduction, y_train) # Wall time: 15.6 s
X_test_reduction = pca.transform(X_test)
%time knn_clf.score(X_test_reduction, y_test) # Wall time = 2min 20s, score = 0.8728
运行时间减少的同时,预测准确率反面提高了
PCA在降维的同时还可以降噪