Receiver Operating Characteristic (ROC) and Precision-Recall Curves with Evaluation Metrics

Introduction
Interactive App

This interactive application is designed to help you explore and understand key evaluation metrics and visualizations for classification models, particularly in healthcare. By experimenting with parameters and thresholds, you can observe and gain understanding in how confusion matrices and derived metrics (such as sensitivity, specificity, F1-score) as well as ROC and Precision-Recall curves behave under varying conditions, providing deeper insight into model behavior and decision-making trade-offs.

Learning goals

Analyze confusion matrix–derived metrics, including sensitivity (recall), specificity, precision (positive predictive value), accuracy, and F1 score.
Explain the principles of Receiver Operating Characteristic (ROC) and Precision–Recall curves, and the Area Under the ROC Curve (AUC) and Area Under the Precision–Recall Curve (AUPRC) metrics.
Evaluate how decision thresholds and model factors such as class balance, sample size, or baseline discriminative ability affect evaluation metrics.
Compare the suitability of different measures, particularly ROC AUC and PR AUC, as performance indicators under varying conditions.

Instructions

Select the 'Interactive App' tab and review the evaluation results under the starting conditions.
Use the left panel to adjust the decision threshold and analyze how the metrics and the threshold point in the curves change. Compare the current threshold with Youden’s optimal point.
Use the left panel to modify class balance, sample size, and baseline discriminative ability individually (moving these sliders will regenerate the data), and analyze the effect first on the confusion matrix and then on the metrics and curves.
Try to simulate the behavior of a real screening test. Look for realistic performance values, consider class imbalance, and analyze the effects of varying parameters and decision thresholds.

Conclusions

While ROC curves remain largely unaffected by class imbalance as based on normalized rates (sensitivity and specificity), Precision–Recall curves can provide an informative picture in settings with imbalanced classes, such as screening tests. In these contexts, both specificity and the precision (PPV) are critical, since even a highly specific test may yield a low PPV when disease prevalence is low.
There is no single universally best threshold; the optimal cutoff depends on clinical context, such as the relative consequences of false negatives and false positives. For example, in cancer screening, sensitivity may be prioritized to avoid missing cases, whereas in confirmatory testing, specificity may be prioritized to reduce unnecessary follow-up procedures.
AUC and AUPRC values summarize overall model discrimination, but they do not indicate how a test performs at a specific threshold. In practice, threshold-dependent metrics such as sensitivity, specificity, PPV, and NPV are often more relevant for decision-making in clinical workflows.

Self-assessment questions

What does the Youden index represent in ROC analysis, and why might it be useful when choosing a threshold?
In what situations can AUPRC be more informative than ROC AUC in low-prevalence (rare disease) settings?
How does changing the decision threshold shift the balance between sensitivity (recall) and specificity?
When only one class present (class balance is set to 0 or 1), some metrics cannot be calculated. which evaluation metrics cannot be calculated, and why?
Why do ROC and PR curves appear jagged or unstable when the sample size is small?
Why can the Precision–Recall curve be non-monotonic, unlike the ROC curve?
When training an AI risk prediction model, what factors could improve or worsen its discriminative ability?