Models & PlatformsSLM

Small Language Model

A compact AI language model — typically under 10 billion parameters — designed to run efficiently on edge devices and single GPUs while delivering strong task-specific performance.

Why it matters

The AI industry is splitting into two lanes: massive frontier models for maximum capability and small, efficient models for practical deployment. For most production use cases — customer support, document processing, code completion, on-device assistants — an SLM fine-tuned for the task is cheaper, faster, and more private than calling a frontier API. Understanding this trade-off is essential for anyone making build-vs-buy decisions around AI.

What Makes a Model "Small"

There is no official threshold, but the industry generally considers models under roughly 10 billion parameters to be "small." The distinction is practical rather than theoretical: SLMs are defined by their hardware profile. A small model can run on a single consumer GPU, a laptop, or even a mobile phone — without requiring the multi-node clusters that frontier models demand.

This constraint drives architectural decisions. SLMs use aggressive knowledge distillation, pruning, and quantization to pack as much capability as possible into fewer parameters. The result is a model that sacrifices some generality but often matches or exceeds larger models on specific tasks it has been optimized for.

Key Examples

Phi-4 (Microsoft) — a 14B-parameter model that benchmarks competitively with models several times its size, trained on carefully curated synthetic data
Gemma (Google) — available in 2B and 7B variants, optimized for on-device deployment and responsible AI research
Llama 3.2 (Meta) — 1B and 3B versions designed for edge and mobile use cases, with multimodal capabilities at compact scale
Ministral (Mistral) — purpose-built for edge computing, emphasizing low latency and offline operation

Trade-offs vs. Large Models

The case for SLMs comes down to three advantages and one honest limitation:

Cost — inference is dramatically cheaper, often 10-50x less per token than frontier models, making high-volume production workloads feasible
Privacy — running locally means data never leaves the device, which matters for healthcare, legal, and enterprise use cases with strict compliance requirements
Latency — on-device inference eliminates network round-trips, enabling real-time applications like autocomplete, on-device agents, and embedded assistants
Breadth vs. depth — SLMs trade generalist knowledge for task-specific strength. They handle focused workflows well but struggle with open-ended reasoning, multi-step planning, and tasks requiring broad world knowledge

On the AI Radar

MistralFrench AI lab shipping open-weight and proprietary models, plus the Le Chat consumer answer engine. The default frontier-grade option for EU enterprises with sovereignty or data-residency constraints, and a meaningful European AEO surface in its own right.LlamaMeta's open-weight model family. Llama matters for AEO because Llama-derived models power a long tail of vertical search engines, enterprise assistants, and embedded chat experiences that aggregate to non-trivial citation volume.