Model Summary

model n parse_success_rate parse_success_ci95 parse_fallback_rate parse_failure_rate avg_latency_ms p95_latency_ms center_selection_rate center_selection_ci95 instability_score label_sensitivity_rate
qwen2.5:7b 1344 99.93% [99.78%, 100.00%] 0.00% 0.07% 11668 14992 38.35% [35.82%, 40.88%] 0.628 49.55%
qwen3:8b 1344 99.93% [99.78%, 100.00%] 0.00% 0.07% 9494 11139 30.68% [28.29%, 33.06%] 0.610 40.48%
gemma4:latest 1248 99.92% [99.76%, 100.00%] 0.00% 0.08% 2471 2837 41.30% [38.65%, 43.95%] 0.558 42.12%
mistral:latest 1248 99.60% [99.28%, 99.92%] 0.00% 0.40% 3036 3576 49.32% [46.50%, 52.13%] 0.593 52.58%
gemma3:4b 1344 99.33% [98.88%, 99.70%] 0.00% 0.67% 5964 6831 40.97% [38.50%, 43.45%] 0.613 50.30%
phi4-mini:3.8b 1344 96.58% [95.61%, 97.54%] 2.38% 1.04% 2189 2717 37.22% [34.73%, 40.00%] 0.619 41.52%
llama3.2:3b 1344 92.93% [91.67%, 94.27%] 0.00% 7.07% 1572 1993 41.15% [38.59%, 43.96%] 0.625 52.68%