Custom SLM
Distillation, quantization, and pruning — a practical enterprise guide
Enterprise language model programs rarely fail because teams cannot train. They fail because serving economics do not match the story told during the pilot. Distillation, quantization, and pruning are three different families of technique that can shrink latency and cost—but they trade off against different risks. Used in the wrong order, they waste weeks: you quantize a bloated teacher, discover accuracy cliffs, then attempt an emergency distillation under release pressure. Used in the right order, they produce a traceable compression trail your platform team can defend.
This article assumes you already have a task definition, a baseline model, and a frozen evaluation harness. If you do not, pause. Compression amplifies mistakes. A smaller model will memorize your evaluation leaks faster than a large one, and quantized weights will punish outliers you ignored in float32.
What problem does distillation actually solve?
Distillation trains a student to mimic a teacher on inputs drawn from (or close to) production. The student can be smaller because it does not need to model the entire internet—only the decision boundaries your application cares about. Distillation is not magic: if the teacher systematically fails on a slice of traffic, the student inherits that blind spot unless you fix labels or add targeted supervision.
Enterprise distillation projects work best when the teacher is already instrumented. You want logged prompts (scrubbed for PII according to policy), teacher outputs, and optional human edits that become silver-standard targets. Without that loop, students overfit to static datasets and fall apart under drift. Pair distillation with a routing policy: keep a larger private model on the slow path for low-confidence cases, as described in hybrid routing engagements.
| Stage | Primary goal | Typical artifacts | Watchouts |
|---|---|---|---|
| Teacher selection | Establish quality ceiling | Baseline evals, refusal policy | Teacher too large to run in CI |
| Data generation | Cover real distribution | Logged prompts, augmented sets | PII leakage into training stores |
| Student training | Match teacher on targets | Checkpoints, loss curves | Collapse into generic answers |
| Validation | Prove regressions bounded | Side-by-side reports, red-team | Overfitting to eval prompts |
When is post-training quantization the right next step?
Quantization maps high-precision weights and activations into lower-bit formats (for example INT8 or INT4 schemes) to reduce memory bandwidth and improve throughput. It is attractive because it can be applied without retraining in some cases, especially on well-behaved models and GPUs with fast kernels. The failure mode is calibration sensitivity: outliers in activations can dominate scales and destroy accuracy on rare tokens that matter legally or financially.
Treat quantization as a measurement exercise first. Run accuracy and latency benchmarks per layer type, and keep rollback artifacts. If your organization requires reproducible builds, pin kernel versions and record which quantization recipe produced each release. For teams operating dedicated hardware, align quantization choices with what your SLM infrastructure vendor or internal platform actually accelerates—there is little value in INT4 if your serving stack falls back to slow paths.
How does pruning differ from quantization?
Pruning removes parameters or entire structures (attention heads, channels, layers) based on saliency or structured sparsity patterns. Unstructured pruning can yield high sparsity but may not translate to speedups unless hardware and kernels exploit it. Structured pruning—thinning layers to predictable shapes—often plays better with real accelerators and predictable batching.
Pruning interacts with distillation: a common pattern is train → distill → prune → quantize, but the optimal order depends on your accuracy budget. Some teams prune early to reduce training cost; others prune late to avoid destabilizing the student during distillation. The enterprise decision driver is risk management: each step should have a go/no-go gate tied to eval thresholds, not intuition.
How should evaluation change during compression?
Compression changes the error surface. You need three layers of tests:
- Regression suite on frozen prompts with known-good references (allow small stylistic drift, not factual drift).
- Stress suite for long contexts, multilingual snippets, and malformed inputs—exactly where quantization bites.
- Safety suite for refusals, PII handling, and policy violations, especially if distillation accidentally rewards overly agreeable answers.
Report metrics in business terms where possible: defect rate per thousand tickets, percentage of clauses correctly flagged, or human review hours saved—whatever your workflow actually values. Token-level perplexity is a diagnostic, not a KPI.
What does a sensible timeline look like?
Week 1–2: freeze task, baseline teacher, build eval harness. Week 3–5: generate distillation data, train student candidates, run side-by-side reviews with domain experts. Week 6: select student, run quantization experiments with calibration sets drawn from production-shaped traffic (not from the training set). Week 7: load tests + incident runbooks + rollback drills. This is illustrative—regulated environments may stretch gates—but the sequencing matters more than the calendar.
If you are deciding whether to rent capacity versus own it, pair this timeline with the economic discussion in on-prem SLM inference vs rented GPU cloud. Compression changes monthly burn more dramatically than most fine-tunes.
Which roles need to be in the room?
Machine learning engineers own recipes, but platform engineering owns kernels and batching, security owns data handling for logged prompts, and finance should see cost-per-inference projections that include failover to larger models. Without finance, teams optimize accuracy; without security, teams create shadow datasets; without platform, models never hit the predicted throughput.
How should you document compression for auditors?
Auditors and customers increasingly ask for model lineage: which teacher produced which student, which dataset slices were used, and which evaluation gates passed. Treat compression like a regulated build pipeline. Store immutable checkpoint identifiers, link them to evaluation reports, and record quantization recipes (calibration sample sizes, outlier handling, and kernel versions). If you cannot reproduce a release, you cannot defend it.
Documentation should also cover data minimization: what was logged for distillation, how long it is retained, and who can access it. Compression projects often increase logging volume because teams crave more teacher/student pairs—coordinate with DPOs early so logging defaults do not violate retention policies.
Why do tooling gaps cause the most rework?
The fastest way to burn a quarter is to discover—after shipping—that your serving stack does not support the fused kernels you assumed, or that your observability tool cannot attribute latency to individual model versions. Before you lock a compression plan, validate end-to-end on production-shaped batches: authentication overhead, JSON serialization, retrieval calls, and GPU queueing all eat the savings quantization promised on microbenchmarks.
If your organization is still maturing MLOps, bias toward fewer moving parts: one student, one quantization scheme, one routing rule. Prove stability, then expand. Prematurely stacking techniques creates incidents where nobody knows which layer caused the regression.
How do you set guardrails so students stay “boring”?
Enterprises usually want models that are reliably dull: correct, cautious, and consistent with policy. Distillation can accidentally reward fluency over correctness if targets are noisy. Mitigate by mixing silver labels with constraint losses—penalties for disallowed claims—and by keeping a living “do not imitate” set for toxic teacher behaviors. Red-team the student specifically for overconfidence on out-of-domain prompts; students often extrapolate more aggressively than teachers when logits are sharpened.
Also watch length bias. If teachers ramble, students learn to ramble, which increases latency and cost downstream. Add length-aware rewards or post-process templates so outputs match operational expectations.
Finally, plan for teacher upgrades. When your teacher model jumps a version, student behavior can shift even if student weights are unchanged—because the world around the student changed. Maintain compatibility tests whenever teachers update, and keep at least one frozen teacher checkpoint for regression comparisons until the student is revalidated.
If you compress multiple students for different locales or business units, namespace their artifacts aggressively. Nothing is more painful than discovering two teams quantized “the same” student with different calibration sets and shipped them under identical version labels.
Keep a compression changelog alongside your model changelog: which techniques landed, which eval deltas were accepted, and who signed the risk acceptance. Future you—and future auditors—will thank present you.
When budgets tighten, compression projects are tempting because they look technical instead of political. Anchor them to finance-visible dashboards so they compete fairly against other initiatives—not as a magic wand owned solely by ML.
Key takeaways
- Distillation shapes behavior; quantization fits numbers into efficient formats; pruning changes which weights exist—sequence them with gates, not hope.
- Build evals before you compress; compression exposes sloppy baselines.
- Treat quantization calibration as part of release engineering, not a one-off script.
- Pair compressed models with routing and observability—see also SLM vs LLM decision notes.
Ready to map compression to your residency and latency constraints? Start a PoC conversation and we will propose a measurable student model scope.
Related articles
- Why fine-tuning alone is not enough for enterprise SLMs
Fine-tuning moves the loss curve, but production SLMs need latency, cost, and governance properties that training alone rarely delivers.
- On-prem SLM inference vs rented GPU cloud: how to choose
The decision is not ideological—it is a bundle of networking, procurement, incident response, and unit economics that changes with your traffic shape.
- SLM vs LLM in the enterprise: a practical decision framework
Use a scorecard—not slogans—to decide when a specialized small model should own a workflow versus when a larger private LLM must stay in the loop.