ROC:Receiver Operation Characteristic Curve
ROC曲线描述TPR和FPR之间的关系。
TPR = recall = TP / (TP + FN) # true positive rate FPR = FP / (FP + TN) # false positive rate
TPR和FPR的关系如图:
TPR和FRP呈现相一致的趋势
代码
回顾前面学过的代码
def TN(y_true, y_predict):
assert len(y_true) == len(y_predict)
return np.sum((y_true == 0) & (y_predict==0)) # 注意这里是一个‘&’
def FP(y_true, y_predict):
assert len(y_true) == len(y_predict)
return np.sum((y_true == 0) & (y_predict==1))
def FN(y_true, y_predict):
assert len(y_true) == len(y_predict)
return np.sum((y_true == 1) & (y_predict==0))
def TP(y_true, y_predict):
assert len(y_true) == len(y_predict)
return np.sum((y_true == 1) & (y_predict==1))
def confusion_matrix(y_true, y_predict):
return np.array([
[TN(y_true, y_predict), FP(y_true, y_predict)],
[FN(y_true, y_predict), TP(y_true, y_predict)]
])
def precision_score(y_true, y_predict):
tp = TP(y_true, y_predict)
fp = FP(y_true, y_predict)
try:
return tp / (tp + fp)
except: # 处理分母为0的情况
return 0.0
def recall_score(y_true, y_predict):
tp = TP(y_true, y_predict)
fn = FN(y_true, y_predict)
try:
return tp / (tp + fn)
except:
return 0.0
TPR和FPR
def TPR(y_true, y_predict):
tp = TP(y_true, y_predict)
fn = FN(y_true, y_predict)
try:
return tp / (tp + fn)
except:
return 0.0
def FPR(y_true, y_predict):
fp = FP(y_true, y_predict)
tn = TN(y_true, y_predict)
try:
return fp / (fp + tn)
except:
return 0.0
加载测试数据
import numpy as np
from sklearn import datasets
digits = datasets.load_digits()
X = digits.data
y = digits.target.copy()
y[digits.target==9] = 1
y[digits.target!=9] = 0
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
绘制TFP和FRP的曲线,即ROC
decision_scores = log_reg.decision_function(X_test)
import matplotlib.pyplot as plt
fprs = []
tprs = []
thresholds = np.arange(np.min(decision_scores), np.max(decision_scores), step=0.1)
for threshold in thresholds:
y_predict = np.array(decision_scores >= threshold, dtype='int')
fprs.append(FPR(y_test, y_predict))
tprs.append(TPR(y_test, y_predict))
plt.plot(fprs, tprs)
plt.show()
sklearn中的ROC曲线
from sklearn.metrics import roc_curve
fprs, tprs, thresholds = roc_curve(y_test, decision_scores)
plt.plot(fprs, tprs)
plt.show()
我们通常关注这条曲线下面的面积。面积越大,说明分类的效果越好。
ROC score
ROC score代码曲线下面的面积。
auc = area under curl
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, decision_scores)
输出:0.9830452674897119
总结
对于有偏数据,观察它的精准率和召回率是非常有必要的。
但是ROC曲线对有偏数据并不敏感,它主要用于比较两个模型的孰优孰劣。
如果两根曲线分别代码两个模型的ROC曲线,在这种情况下我们会选择外面那根曲线对应模型。