Benchmark and scoring methodology

Evidence for local setup decisions.

InferGrade measures specific model, quant, runtime, hardware, and benchmark slices so users can choose what to run next. It keeps capability, deployment fitness, quant fidelity, provenance, and claim boundaries separate.

What InferGrade measures

A result is not a generic model review. It is evidence for a declared setup on declared hardware under declared checks.

Surface Examples What it can support
Deployment fitness TTFT, output tokens/sec, load time, memory, runtime stability. Whether a setup fits and behaves acceptably on the recorded machine.
Assistant capability Instruction following, structured output, short conversational retention. Scoped assistant behavior under the selected benchmark lane and scoring policy.
Coding capability Static repair tasks, EvalPlus HumanEval+, EvalPlus MBPP+ when intentionally selected. Scoped code generation or repair evidence, not broad agentic software engineering proof.
Reasoning capability Exact-answer checks and sampled MMLU-Pro reference evidence. Scoped reasoning or knowledge evidence for the declared prompt set.
Quant fidelity Perplexity reference checks with a pinned corpus and protocol. Same-family quant comparison only when family, checkpoint, tokenizer, corpus, and protocol match.

Evidence lanes

Lanes describe how much benchmark support a result carries. A high score inside a thin lane does not promote the result into a stronger lane.

Lane Role Claim boundary
Smoke Minimal setup execution check. Shows that a setup can run. It does not support capability conclusions.
Decision Small local sample for one use case on one machine. Can help choose a setup inside the scoped recommendation slice.
Reference Deeper lane with pinned fixtures or dataset revisions, preserved artifacts, and reviewed scoring policy. Supports stronger scoped comparisons, still not a broad leaderboard claim.
Gold Reserved for stronger dataset, sandbox, provenance, access, and maintainer-review controls. Only appears after the stronger controls exist for that lane.

Capability scoring

Current benchmark map

Use case Current checks Status
Chat and instruction following IFEval, multi-turn chat memory, reasoning exact answer, sampled MMLU-Pro reference, deployment chat/batch/long-context telemetry. Local decision checks plus selected reference lanes.
Coding and code editing Coding static repair, EvalPlus HumanEval+, EvalPlus MBPP+, deployment telemetry. Decision and executable reference evidence. Not SWE-bench or repo-edit proof.
Quant fidelity Perplexity reference with a pinned InferGrade corpus and protocol. Reference evidence for same-family quant comparison only.
Planned heavier lanes GPQA, LiveCodeBench, SWE-bench Verified. Not rendered as default runnable checks until harness, scoring, data, sandbox, and claim gates are satisfied.

Recommendation methodology

  1. Start from the user's task, hardware, runtime, memory, trust, and benchmark scope.
  2. Prefer exact or closely comparable evidence before broader family evidence.
  3. Rank decision-grade candidates ahead of informational evidence when the candidate clears the comparison floor.
  4. Use agent dogfood and native first-run evidence as labeled context unless it clears the same comparability and evidence gates as other results.
  5. Show what the current evidence supports, what it does not support, and the next benchmark that would most improve confidence.
Dogfood evidence is useful, but bounded.

Agent dogfood runs are real public InferGrade development runs. They help stress-test recommendations, plots, tables, and runner paths. They do not imply broad community consensus or an official ranking.

Evidence labels users will see

Decision-grade evidence

Evidence that clears the current comparison floor for a scoped recommendation.

Reference evidence

Evidence from a deeper lane with stronger benchmark controls and preserved artifacts.

Agent dogfood evidence

Public development runs from labeled agent runners. Useful context, not broad validation.

Informational evidence

Stored evidence that can guide inspection but does not currently drive the main ranking.

Native first-run evidence

Local telemetry that shows a setup executed and uploaded. It is not enough by itself for a capability claim.

Failed, partial, or missing

Execution truth that explains a gap or failure without pretending it is a completed low score.

How user runs improve the system