Private AI
Private AI infrastructure: designing boundaries that survive audits
“Private AI” means different things to different buyers. For some, it is no third-party inference. For others, it is EU-only processing with named subprocessors. For defense customers, it may mean air-gapped training and serving with portable artifacts. Infrastructure architects must translate marketing language into concrete controls: VPC boundaries, key management, backup locations, admin access paths, and logging scopes.
This guide helps you align technical design with what Legal commits in DPAs. If your program also needs model specialization, read the enterprise SLM guide for delivery sequencing and on-prem vs rented GPU for hosting economics.
What are the non-negotiable building blocks?
- Identity: machine-to-machine auth for every caller; no shared API keys in source control.
- Encryption: TLS in transit, KMS-backed keys at rest, clear rotation owners.
- Segmentation: inference subnets, separate admin planes, egress filtering.
- Observability: metrics and traces that exclude sensitive payloads by default.
- Change management: versioned infrastructure-as-code and auditable promotions for models.
Skip any one block and your “private” story develops holes auditors love to widen.
How should data classification drive architecture?
Classify inputs and outputs before you sketch networks. Public marketing copy, internal operational notes, confidential customer payloads, and restricted regulated data each imply different storage, retention, and cross-border rules. Map classes to perimeters: which subnets, buckets, and KMS keys apply.
For generative workloads, remember outputs are data. Logs that capture full prompts may reclassify a system from “internal AI helper” to “database of customer secrets.” Prefer hashed identifiers, truncated samples, and configurable redaction pipelines.
| Data class | Typical controls | Common mistakes |
|---|---|---|
| Public | CDN caching allowed | Accidentally mixing in user uploads |
| Internal | SSO + VPC | Overly broad engineer access |
| Confidential | Field-level encryption, strict RBAC | Debug dumps in tickets |
| Restricted | Air-gap or dedicated enclave | Shadow copies on laptops |
What network patterns work for SLM serving?
A pragmatic pattern is API gateway → auth → policy → inference workers → optional tool plane. The gateway terminates TLS and applies rate limits. The policy layer enforces tenant scoping and content rules before tokens hit GPUs. Tool calls (retrieval, CRM lookups) traverse separate service accounts with least privilege.
For multi-tenant SaaS, enforce hard isolation at the data layer—row-level security, separate schemas, or separate databases per tier depending on risk. Soft isolation with only application-level checks has failed enough times to be a case study genre.
How do keys and secrets actually get managed?
Use a cloud KMS or HSM with customer-managed keys when contracts require it. Separate keys for model artifacts, application data, and backups. Document who can invoke decrypt operations and how break-glass works. Emergency access should be time-boxed, multi-party, and logged aggressively.
Rotate keys on a schedule and after incidents. Models may need re-downloading after rotation—practice that path.
What should logging and monitoring exclude?
Default to structured logs without bodies. Store request IDs, latency histograms, token counts, routing decisions, and error classes. When debugging requires payloads, use temporary elevated logging with automatic expiry and explicit approvals.
Tracing tools can accidentally capture secrets if engineers annotate spans carelessly. Provide libraries that sanitize by default and train teams on safe instrumentation.
How do backups and DR respect privacy?
Backups are copies—subject to the same residency rules as primaries. If you replicate to another region for DR, update DPIAs. Test restores regularly; encrypted backups you cannot restore are indistinguishable from data loss.
For model weights, keep checksum manifests so restores cannot silently drift. Pair manifests with evaluation snapshots so you know what behavior a restored artifact implies.
How does human access fit the threat model?
Vendor support, platform admins, and ML engineers are all insider threats in the compliance sense—even when benevolent. Use just-in-time access, session recording where appropriate, and separation of duties for production changes. Admin VPNs should not double as developer convenience paths.
How do SLMs change the footprint?
Smaller models enable edge and on-prem deployment options that large models cannot meet—see compression guidance in distillation and quantization. Smaller footprints reduce attack surface but increase version skew risk across sites. Centralize policy distribution and maintain compatibility tests for edge packages.
How do you apply zero trust to model endpoints?
Assume breach. Every inference request should present cryptographic identity, be authorized against a policy that includes tenant and data class, and be rate-limited per caller to contain abuse. Mutual TLS between internal services beats long-lived bearer tokens stuffed into environment variables.
Publish threat models for prompt injection, model denial-of-service (massive contexts), and unauthorized tool invocation. Mitigations combine input sanitation, output filtering, tool sandboxing, and human approvals for high-impact actions. The model is not the security boundary—your orchestration layer is.
What about supply chain security for weights and containers?
Treat base weights like third-party binaries: verify checksums, scan containers, and pin dependencies. Prefer internal registries that cache approved artifacts. When fine-tuning or distilling, record provenance links from dataset versions to checkpoint IDs so you can reconstruct lineage after the fact.
CI/CD pipelines that build serving images should use immutable tags and signed attestations where possible. If someone can push a replacement image without detection, your private network is theater.
How should egress be controlled when models call tools?
Tooling is where “private AI” leaks. Retrieval may call open web indexes; code agents may hit package registries. Define allow-lists per environment: production SLMs might only reach internal vector stores and approved APIs, while research sandboxes are looser with different data rules.
Log tool outcomes, not necessarily full payloads. Alert on new domains, unusual error spikes, and sudden increases in token usage—often the first sign of runaway agents or abuse.
What breaks when drivers and firmware drift?
GPU stacks are fragile. A silent driver upgrade can change numerics or kernel fusion behavior, shifting latency distributions without a “model change.” Pin driver stacks in production, test upgrades in staging with shadow traffic, and keep rollback images handy.
Patching policy must reconcile security urgency with ML stability. Document who approves exceptions and for how long.
How do hybrid and multi-cloud designs complicate residency?
Hybrid is attractive—train in one footprint, serve in another—but data movement must match legal narratives. If embeddings replicate across clouds, your DPIA must say so. Prefer explicit replication jobs with monitoring over implicit caches nobody documented.
For multi-cloud, standardize identity (e.g., workload identities) and avoid divergent secret stories per provider. Complexity is the enemy of auditability.
How do SIEM and SOC teams consume AI telemetry?
Give them signals, not raw prompts. Useful events: authentication failures, policy denials, abnormal token usage, tool egress blocks, model version promotions, and configuration drift. Map events to MITRE-style tactics where helpful so existing playbooks extend naturally.
Run joint tabletop exercises: “Model endpoint keys leaked—what rotates first?” and “Ransomware in training subnet—how do we isolate without deleting weeks of work?”
What is the relationship between compliance cost and incident cost?
| Investment area | Upfront cost | Failure mode if skipped |
|---|---|---|
| Classification + DPIAs | Legal time | Regulatory sanctions, contract breaches |
| Segmentation + IAM | Engineering time | Lateral movement, data exfiltration |
| Observability | Ongoing storage | Blind incidents, slow MTTR |
| DR testing | Ops time | Real data loss during outages |
| Training for staff | Enablement | Shadow tools bypass controls |
How do customer-managed keys (CMK) change operations?
CMK shifts responsibility: customers hold key material; you must handle denial of access gracefully. Design inference paths to surface clear errors when KMS policies block decrypt, and avoid caching decrypted weights longer than necessary. Test key rotation with realistic load—some systems freeze when rotation events overlap with peak traffic.
Document shared responsibility: who patches hypervisors, who manages HSMs, who audits access logs. Ambiguity becomes finger-pointing during incidents.
What extra scrutiny applies in EU deployments?
EU customers often ask for subprocessor transparency, DPA terms matching Schrems-era expectations, and clarity on transfer mechanisms if any telemetry leaves the region. Even when compute stays in Frankfurt, support tooling headquartered elsewhere may still implicate transfers—map it honestly.
Pair legal answers with technical proof: network diagrams showing packet paths, lists of IP ranges touched, and retention periods per log stream. Lawyers and engineers should review each other’s drafts to catch mismatches.
How should agent and orchestration layers inherit policies?
Agents compound risk: more tools, more tokens, more opportunities for injection. Inherit the strictest data class handled anywhere in the workflow. If one step touches restricted data, the whole trace should run in the restricted perimeter unless you have formally proven isolation—rarely true on first release.
Orchestration should enforce per-step authorization: which tools exist, which credentials they use, and what constitutes a successful completion. Link to agent orchestration services when multi-step automation is in scope, but never let orchestration bypass network policy.
How does this connect to model tier decisions?
Infrastructure choices interact with which model size you can afford to run privately. If networking or GPU constraints bite, revisit the business scorecard in SLM vs LLM decisions—sometimes a smaller SLM inside a tight perimeter beats a larger model that tempts risky shortcuts.
How should penetration tests and red teams scope AI systems?
Traditional app pen tests miss model-specific failures: prompt injection, data exfiltration via creative completions, and tool misuse. Scope should include authenticated and unauthenticated surfaces, admin APIs, notebook gateways, and CI tokens that can reach training data. Provide safe harness environments so testers do not accidentally exfiltrate real customer payloads.
Red teams should attempt privilege escalation via prompts—“ignore previous instructions” variants, indirect injection via retrieved documents, and multi-turn jailbreaks. Findings should feed directly into eval suites so regressions are caught pre-release.
What belongs in a responsible disclosure program?
Publish a security contact, expected response times, and safe harbor language for good-faith research. Internally, route findings to owners with SLA-backed triage. Model vulnerabilities may not have CVEs—track them like product defects with severity, workaround, and fix versions.
Finally, rehearse communications: what you tell customers during an incident, what you omit, and how quickly legal approves statements. The technical response can be perfect while trust erodes from silence.
What evidence bundle should you prepare for customers?
Prepare a security pack: architecture diagram, data flow narrative, subprocessors, encryption specs, logging policy, incident response summary, and pen-test status. Update it when infrastructure changes—not quarterly slide decks nobody trusts.
Key takeaways
- Private AI is controls + evidence, not a single “private checkbox” on a vendor form.
- Classify data first; architecture follows.
- Logging defaults should be privacy-preserving; escalate intentionally.
- DR and backups must honor residency and restore reality, not wishful thinking.
If you want an external architecture review against your residency commitments, contact SLM-Works with your current diagrams—we will map gaps before they become customer escalations.
Related articles
- On-prem SLM inference vs rented GPU cloud: how to choose
The decision is not ideological—it is a bundle of networking, procurement, incident response, and unit economics that changes with your traffic shape.
- SLM vs LLM in the enterprise: a practical decision framework
Use a scorecard—not slogans—to decide when a specialized small model should own a workflow versus when a larger private LLM must stay in the loop.
- Distillation, quantization, and pruning — a practical enterprise guide
Compression is not a single knob. Here is how distillation, quantization, and pruning interact when you need smaller models without wrecking production metrics.