AI Model Performance Metrics 2025
Report type: QA evaluation summary
Date: 2025-11-30
Summary
- Internal evaluations show 95%+ accuracy on curated customer-support and medical FAQ sets.
- Accuracy measured using blinded grading and consensus review.
- Report documents the datasets, grading rubric, and QA workflow used for scoring.
- Results are summarized per dataset and rolled up for overall reporting.
- Findings inform QA thresholds and release readiness reviews.
Key Data
- 95%+ accuracy on curated customer-support and medical FAQ sets.
- Blinded grading with consensus review for final scores.
- Dataset summaries include customer-support intents and medical FAQ categories.
- Evaluation references include rubric definitions and review workflow notes.
Sources
- Chat Data QA evaluation protocol 2025.
- Blinded grading rubric and consensus review guide (internal).
- Chat Data internal QA evaluation harness.
- NIST AI Risk Management Framework 1.0