Parameter counts used to be a bragging contest; 2024 flipped the script. Models under 15 B parameters now outscore 70–180 B giants—but run on a laptop NPU.
1 | Microsoft’s Phi family rewrites the scaling law
Medium’s deep dive shows Phi‑4‑Reasoning topping models 50× larger on Olympiad math while fitting on a consumer GPU/NPU. Medium
2 | Mixtral & Mistral: MoE on a Diet
- Mixtral 8×7B uses sparse routing so only 2 experts run per token—14 B active parameters with 46 B theoretical capacity.
- At equal latency, it beats many dense 34–70 B models.
3 | Gemma 7B & Llama 3 8B: Open‑source goes head‑to‑head
- Google’s Gemma and Meta’s Llama 3 8B both surpass GPT‑3.5 on MMLU when fine‑tuned. Exploding TopicsKlu
4 | Qwen 2.5 & community quantization
R/LocalLLaMA users now run Qwen 2.5 32B quantized to 8 bits on 24 GB GPUs, balancing quality and local privacy. Reddit
5 | What makes a small LLM good in 2024?
Ingredient | Example | Impact |
---|---|---|
Curated, textbook‑style data | Phi series | quality > size |
Synthetic “self‑play” data | Phi, Gemma | rare reasoning patterns |
Sparse MoE | Mixtral | latency stays low |
Longer context via sliding window | Llama 3 8B‑128k | enterprise docs |
Quantization (4‑8 bit) | Qwen2.5 | local inference |
6 | Deployment math: why clients care
- Cloud cost: Serving Phi‑3 Mini (~7 B) at 4 bit uses ~0.4 GB VRAM → <$0.05 /hr GPU vs $1+ for GPT‑4.
- Carbon: A sub‑10 B model can cut CO₂ ~80 % compared to a 70 B model per 1 M queries.
7 | Slim ML Playbook
- Model audit: choose the smallest model meeting your metrics.
- Data distillation: we apply refinement datasets (domain + synthetic).
- Edge or micro‑cloud deploy: Docker+GGUF for on‑prem or IoT.