External benchmark scores across all research sessions — Novelty (40%), Mechanistic Validity (40%), Falsifiability (20%). Compared against the self-assigned Process Score to measure calibration.