Benchmark and scoring methodology
Evidence for local setup decisions.
InferGrade measures specific model, quant, runtime, hardware, and benchmark slices so users can choose what to run next. It keeps capability, deployment fitness, quant fidelity, provenance, and claim boundaries separate.
What InferGrade measures
A result is not a generic model review. It is evidence for a declared setup on declared hardware under declared checks.
| Surface | Examples | What it can support |
|---|---|---|
| Deployment fitness | TTFT, output tokens/sec, load time, memory, runtime stability. | Whether a setup fits and behaves acceptably on the recorded machine. |
| Assistant capability | Instruction following, structured output, short conversational retention. | Scoped assistant behavior under the selected benchmark lane and scoring policy. |
| Coding capability | Static repair tasks, EvalPlus HumanEval+, EvalPlus MBPP+ when intentionally selected. | Scoped code generation or repair evidence, not broad agentic software engineering proof. |
| Reasoning capability | Exact-answer checks and sampled MMLU-Pro reference evidence. | Scoped reasoning or knowledge evidence for the declared prompt set. |
| Quant fidelity | Perplexity reference checks with a pinned corpus and protocol. | Same-family quant comparison only when family, checkpoint, tokenizer, corpus, and protocol match. |
Evidence lanes
Lanes describe how much benchmark support a result carries. A high score inside a thin lane does not promote the result into a stronger lane.
| Lane | Role | Claim boundary |
|---|---|---|
| Smoke | Minimal setup execution check. | Shows that a setup can run. It does not support capability conclusions. |
| Decision | Small local sample for one use case on one machine. | Can help choose a setup inside the scoped recommendation slice. |
| Reference | Deeper lane with pinned fixtures or dataset revisions, preserved artifacts, and reviewed scoring policy. | Supports stronger scoped comparisons, still not a broad leaderboard claim. |
| Gold | Reserved for stronger dataset, sandbox, provenance, access, and maintainer-review controls. | Only appears after the stronger controls exist for that lane. |
Capability scoring
- Scores are scoped to the benchmark lane, task family, scoring policy, model artifact, runtime, hardware, and generation preset that produced them.
- Deployment fitness remains separate from capability. A fast setup is not automatically more capable, and a capable setup may still be impractical on a given machine.
- Quant fidelity remains separate from assistant, coding, and reasoning capability. Perplexity evidence is mainly useful for same-family quant comparison.
- Failed, partial, skipped, not-yet-benchmarked, and not-comparable states are preserved instead of silently becoming low scores.
- InferGrade does not combine all surfaces into a single overall model rating.
Current benchmark map
| Use case | Current checks | Status |
|---|---|---|
| Chat and instruction following | IFEval, multi-turn chat memory, reasoning exact answer, sampled MMLU-Pro reference, deployment chat/batch/long-context telemetry. | Local decision checks plus selected reference lanes. |
| Coding and code editing | Coding static repair, EvalPlus HumanEval+, EvalPlus MBPP+, deployment telemetry. | Decision and executable reference evidence. Not SWE-bench or repo-edit proof. |
| Quant fidelity | Perplexity reference with a pinned InferGrade corpus and protocol. | Reference evidence for same-family quant comparison only. |
| Planned heavier lanes | GPQA, LiveCodeBench, SWE-bench Verified. | Not rendered as default runnable checks until harness, scoring, data, sandbox, and claim gates are satisfied. |
Recommendation methodology
- Start from the user's task, hardware, runtime, memory, trust, and benchmark scope.
- Prefer exact or closely comparable evidence before broader family evidence.
- Rank decision-grade candidates ahead of informational evidence when the candidate clears the comparison floor.
- Use agent dogfood and native first-run evidence as labeled context unless it clears the same comparability and evidence gates as other results.
- Show what the current evidence supports, what it does not support, and the next benchmark that would most improve confidence.
Agent dogfood runs are real public InferGrade development runs. They help stress-test recommendations, plots, tables, and runner paths. They do not imply broad community consensus or an official ranking.
Evidence labels users will see
Evidence that clears the current comparison floor for a scoped recommendation.
Evidence from a deeper lane with stronger benchmark controls and preserved artifacts.
Public development runs from labeled agent runners. Useful context, not broad validation.
Stored evidence that can guide inspection but does not currently drive the main ranking.
Local telemetry that shows a setup executed and uploaded. It is not enough by itself for a capability claim.
Execution truth that explains a gap or failure without pretending it is a completed low score.
How user runs improve the system
- A paired Runner emits provenance-rich result bundles with model, quant, runtime, hardware, task, scoring, and artifact metadata.
- Published results add evidence for exact setup matches and nearby family or hardware comparisons.
- Repeated and reference-lane runs can close thin-evidence gaps for future users.
- Private results remain private unless the owner publishes them.