Local LLM Benchmark — Star Map Comparison
Local LLM Benchmark

Model Capability Star Map

How do small local models perform on real-world tasks?

Quality assessed by Claude Sonnet 4.5 (LLM-as-judge) · 126 tasks per model · GDPval dataset by OpenAI

Qwen2.5-3B
67.4%
vs
Qwen3.5-2B
75.0%
vs
Qwen3.5-0.8B
48.4%
Avg Score
67.4%
across 9 categories
Strongest
Structured Output
92.9%
Weakest
Reasoning
51.5%
Tasks
31
20 passed
Category Breakdown
THE HEADLINE
The newer Qwen3.5-2B scores 75% on practical tasks — beating the larger Qwen2.5-3B at 67%.
On professional GDPval tasks, all three converge: 52%, 52%, and 45%.
As judged by Claude Sonnet 4.5 · 126 tasks per model · Running locally on a $300 GPU