Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces
Accuracy treats every answer the same, but deployed systems act on the outputs a model is most confident about. FRS evaluates reasoning quality on exactly those traces, conditioning the metric on the model's own confidence. The result is a view of reliability that accuracy alone cannot provide: two models with identical scores can behave very differently when you only trust what they are sure of, and FRS makes that difference measurable.