InferGrade Methodology

What InferGrade measures

A result is not a generic model review. It is evidence for a declared setup on declared hardware under declared checks.

Surface	Examples	What it can support
Deployment fitness	TTFT, output tokens/sec, load time, memory, runtime stability.	Whether a setup fits and behaves acceptably on the recorded machine.
Assistant capability	Instruction following, structured output, short conversational retention.	Scoped assistant behavior under the selected benchmark lane and scoring policy.
Coding capability	Static repair tasks, EvalPlus HumanEval+, EvalPlus MBPP+ when intentionally selected.	Scoped code generation or repair evidence, not broad agentic software engineering proof.
Reasoning capability	Exact-answer checks and sampled MMLU-Pro reference evidence.	Scoped reasoning or knowledge evidence for the declared prompt set.
Quant fidelity	Perplexity reference checks with a pinned corpus and protocol.	Same-family quant comparison only when family, checkpoint, tokenizer, corpus, and protocol match.

Evidence lanes

Lanes describe how much benchmark support a result carries. A high score inside a thin lane does not promote the result into a stronger lane.

Lane	Role	Claim boundary
Smoke	Minimal setup execution check.	Shows that a setup can run. It does not support capability conclusions.
Decision	Small local sample for one use case on one machine.	Can help choose a setup inside the scoped recommendation slice.
Reference	Deeper lane with pinned fixtures or dataset revisions, preserved artifacts, and reviewed scoring policy.	Supports stronger scoped comparisons, still not a broad leaderboard claim.
Gold	Reserved for stronger dataset, sandbox, provenance, access, and maintainer-review controls.	Only appears after the stronger controls exist for that lane.

Capability scoring

InferGrade publishes simple 0–100 local scores for a named task surface, never one global intelligence score. The score stays unpublished until at least 50% of its weighted benchmark coverage is present.

Scores are scoped to the benchmark lane, task family, scoring policy, model artifact, runtime, hardware, and generation preset that produced them.
Deployment fitness remains separate from capability. A fast setup is not automatically more capable, and a capable setup may still be impractical on a given machine.
Quant fidelity remains separate from assistant, coding, and reasoning capability. Perplexity evidence is mainly useful for same-family quant comparison.
Failed, partial, skipped, not-yet-benchmarked, and not-comparable states are preserved instead of silently becoming low scores.
InferGrade does not combine all surfaces into a single overall model rating.

Score	Version 1 components	What it means
Local assistant score	IFEval 75% · multi-turn memory 25%	Instruction following and conversational retention for the measured local setup.
Local coding score	HumanEval+ 55% · MBPP+ 30% · static repair 15%	Scoped executable and static coding evidence, not repository-level agent performance.
Local reasoning score	MMLU-Pro 80% · exact-answer check 20%	Scoped sampled reasoning evidence, not a general intelligence claim.

Speed, output length, and time per task

Decode tokens per second shows generation throughput on the recorded hardware and runtime.
Output tokens per task keeps terse and verbose completions visible instead of rewarding throughput alone.
Median time per task is the practical synthesis: how long the measured setup took to produce a complete answer.
InferGrade only shows these task measurements when the backend reported timing or token counts; missing telemetry is not estimated.

Current benchmark map

Use case	Current checks	Status
Chat and instruction following	IFEval, multi-turn chat memory, reasoning exact answer, sampled MMLU-Pro reference, deployment chat/batch/long-context telemetry.	Local decision checks plus selected reference lanes.
Coding and code editing	Coding static repair, EvalPlus HumanEval+, EvalPlus MBPP+, deployment telemetry.	Decision and executable reference evidence. Not SWE-bench or repo-edit proof.
Quant fidelity	Perplexity reference with a pinned InferGrade corpus and protocol.	Reference evidence for same-family quant comparison only.
Planned heavier lanes	GPQA, LiveCodeBench, SWE-bench Verified.	Not rendered as default runnable checks until harness, scoring, data, sandbox, and claim gates are satisfied.

Recommendation methodology

Start from the user's task, hardware, runtime, memory, trust, and benchmark scope.
Prefer exact or closely comparable evidence before broader family evidence.
Rank decision-grade candidates ahead of informational evidence when the candidate clears the comparison floor.
Use agent dogfood and native first-run evidence as labeled context unless it clears the same comparability and evidence gates as other results.
Show what the current evidence supports, what it does not support, and the next benchmark that would most improve confidence.

Dogfood evidence is useful, but bounded.

Agent dogfood runs are real public InferGrade development runs. They help stress-test recommendations, plots, tables, and runner paths. They do not imply broad community consensus or an official ranking.

Evidence labels users will see

Decision-grade evidence

Evidence that clears the current comparison floor for a scoped recommendation.

Reference evidence

Evidence from a deeper lane with stronger benchmark controls and preserved artifacts.

Agent dogfood evidence

Public development runs from labeled agent runners. Useful context, not broad validation.

Informational evidence

Stored evidence that can guide inspection but does not currently drive the main ranking.

Native first-run evidence

Local telemetry that shows a setup executed and uploaded. It is not enough by itself for a capability claim.

Failed, partial, or missing

Execution truth that explains a gap or failure without pretending it is a completed low score.

How user runs improve the system

A paired Runner emits provenance-rich result bundles with model, quant, runtime, hardware, task, scoring, and artifact metadata.
Published results add evidence for exact setup matches and nearby family or hardware comparisons.
Repeated and reference-lane runs can close thin-evidence gaps for future users.
Private results remain private unless the owner publishes them.

Open Hub Install quickstart Read known limits

Evidence for local setup decisions.