判断机器学习算法的性能

改进:训练和测试数据集的分离,train test split

但这种方式也有它的问题,后面其它小节会讲到

准备iris数据

import numpy as np import matplotlib.pyplot as plt from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target

X.shape为(150, 4),y.shape为(150,)

train_test_split

注意1:本例中训练数据集的为如下:

因此按顺序取前多少个样本不会有很好的效果,要先对数据乱序化

注意2:本例中X和y是分离的,但它们不能分别乱序化。乱序化的同时要保证样本和标签是对应的。

在Notebook实现

shuffle_indexes = np.random.permutation(len(X)) test_ratio = 0.2 test_size = int(len(x) * test_ratio) test_indexes = shuffle_indexes[:test_size] train_indexes = shuffle_indexes[test_size:] X_train = X[train_indexes] y_train = y[train_indexes] X_test = X[test_indexes] y_test = y[test_indexes]

将train_test_split封装成函数

import numpy as np def train_test_split(X, y, test_ratio=0.2, seed=None): """将X和y按照test_ratio分割成X_train,X_test,y_train,y_test""" assert X.shape[0] == y.shape[0], "the size of X must be equal to the size of y" assert 0.0 <= test_ratio <= 1.0, "test_ration must be valid" if seed: np.random.seed(seed) shuffle_indexes = np.random.permutation(len(X)) test_size = int(len(x) * test_ratio) test_indexes = shuffle_indexes[:test_size] train_indexes = shuffle_indexes[test_size:] X_train = X[train_indexes] y_train = y[train_indexes] X_test = X[test_indexes] y_test = y[test_indexes] return X_train, X_test, y_train, y_test

KNN结合train_test_split计算分类准确度

X_train, X_test, y_train, y_test = train_test_split(X, y) my_Knn_clf = kNNClassifier(k = 3) # kNNClassifier在上一节实现 my_Knn_clf.fit(X_train, y_train) y_predict = my_Knn_clf.predict(X_test) accuracy = sum(y_predict == y_test) / len(y_test)

sklearn中的train_test_split

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)