ROC:Receiver Operation Characteristic Curve
ROC曲线描述TPR和FPR之间的关系。

TPR = recall = TP / (TP + FN) # true positive rate FPR = FP / (FP + TN) # false positive rate

TPR和FPR的关系如图:

TPR和FRP呈现相一致的趋势

代码

回顾前面学过的代码

def TN(y_true, y_predict): assert len(y_true) == len(y_predict) return np.sum((y_true == 0) & (y_predict==0)) # 注意这里是一个‘&’ def FP(y_true, y_predict): assert len(y_true) == len(y_predict) return np.sum((y_true == 0) & (y_predict==1)) def FN(y_true, y_predict): assert len(y_true) == len(y_predict) return np.sum((y_true == 1) & (y_predict==0)) def TP(y_true, y_predict): assert len(y_true) == len(y_predict) return np.sum((y_true == 1) & (y_predict==1)) def confusion_matrix(y_true, y_predict): return np.array([ [TN(y_true, y_predict), FP(y_true, y_predict)], [FN(y_true, y_predict), TP(y_true, y_predict)] ]) def precision_score(y_true, y_predict): tp = TP(y_true, y_predict) fp = FP(y_true, y_predict) try: return tp / (tp + fp) except: # 处理分母为0的情况 return 0.0 def recall_score(y_true, y_predict): tp = TP(y_true, y_predict) fn = FN(y_true, y_predict) try: return tp / (tp + fn) except: return 0.0

TPR和FPR

def TPR(y_true, y_predict): tp = TP(y_true, y_predict) fn = FN(y_true, y_predict) try: return tp / (tp + fn) except: return 0.0 def FPR(y_true, y_predict): fp = FP(y_true, y_predict) tn = TN(y_true, y_predict) try: return fp / (fp + tn) except: return 0.0

加载测试数据

import numpy as np from sklearn import datasets digits = datasets.load_digits() X = digits.data y = digits.target.copy() y[digits.target==9] = 1 y[digits.target!=9] = 0 from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666) from sklearn.linear_model import LogisticRegression log_reg = LogisticRegression() log_reg.fit(X_train, y_train)

绘制TFP和FRP的曲线,即ROC

decision_scores = log_reg.decision_function(X_test) import matplotlib.pyplot as plt fprs = [] tprs = [] thresholds = np.arange(np.min(decision_scores), np.max(decision_scores), step=0.1) for threshold in thresholds: y_predict = np.array(decision_scores >= threshold, dtype='int') fprs.append(FPR(y_test, y_predict)) tprs.append(TPR(y_test, y_predict)) plt.plot(fprs, tprs) plt.show()

sklearn中的ROC曲线

from sklearn.metrics import roc_curve fprs, tprs, thresholds = roc_curve(y_test, decision_scores) plt.plot(fprs, tprs) plt.show()

我们通常关注这条曲线下面的面积。面积越大,说明分类的效果越好。

ROC score

ROC score代码曲线下面的面积。
auc = area under curl

from sklearn.metrics import roc_auc_score roc_auc_score(y_test, decision_scores)

输出:0.9830452674897119

总结

对于有偏数据,观察它的精准率和召回率是非常有必要的。
但是ROC曲线对有偏数据并不敏感,它主要用于比较两个模型的孰优孰劣。

如果两根曲线分别代码两个模型的ROC曲线,在这种情况下我们会选择外面那根曲线对应模型。