Running an LLM inside your VPC without making it your full-time job
When private AI actually pays for itself, what the realistic stack looks like (LibreChat / OpenWebUI / Ollama on AWS), and the two failure modes that kill these projects.
Running an LLM inside your VPC without making it your full-time job
Almost every team we talk to about "private AI" is asking the same underlying question: can we get the productivity of ChatGPT without sending our data to OpenAI? The honest answer is mostly, with caveats — and the cost-benefit only swings clearly in your favour for a small set of shapes.
This post is the version we wish more buyers heard before they bought a $300k self-hosted LLM contract.
When private AI actually pays for itself
Three honest reasons to run inference inside your own VPC:
- Data sensitivity that contractually cannot leave your boundary. Healthcare, financial services, classified-adjacent work, regulated personal data with strong residency clauses.
- Concentration risk reduction. You already pay six figures a year to one model vendor and your CTO does not want a single-vendor dependency on the application's critical path.
- Workloads dominated by structured retrieval. RAG over your internal docs where the model is mostly stitching context together, not doing frontier reasoning. Smaller open-weight models become genuinely competitive here.
Reasons that sound like good ones but usually are not:
- "We want to save money on the API bill." For most teams the OpenAI / Anthropic bill is well below the fully-loaded cost of running and operating GPU inference yourself. Run the maths before you commit.
- "Our developers will be more productive with their own LLM." They will be less productive than they would be on a frontier model. Productivity delta is real.
The realistic stack
A realistic private-AI stack for an APAC mid-size org looks roughly like this:
┌────────────────────────────────────────┐
│ Internal users (SSO via Okta / EntraID) │
└─────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────┐
│ LibreChat or OpenWebUI │ ← chat UI, audit log
│ on ECS/Fargate behind ALB │
└─────────────┬───────────────┘
│
┌───────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌─────────────┐ ┌────────────────────┐
│ Ollama on │ │ Bedrock │ │ Vector DB │
│ GPU EC2 │ │ (managed) │ │ (pgvector / Qdrant)│
└──────────────┘ └─────────────┘ └────────────────────┘
Concrete choices that have held up well:
- Chat surface: LibreChat (most flexible, plugin model) or OpenWebUI (tighter UX out of the box). Both deploy as containers; we run them on ECS/Fargate behind an ALB with SSO integration.
- Inference: Ollama on a single g6 or g5 GPU instance for genuinely private workloads, or AWS Bedrock when "on AWS" is enough and you want managed scaling.
- Embeddings + retrieval: pgvector on an existing RDS for teams that already run Postgres; Qdrant on Fargate when you want isolation.
- Logging and audit: structured logs to CloudWatch, with a "last 90 days, who asked what" dashboard the security team can read.
- SSO: Okta or Entra ID via OIDC. We turn on group-based access from day one — it is much harder to retrofit.
The two failure modes that kill these projects
We have unwound enough of these to see the same two deaths.
Failure mode 1: nobody owns the GPU
Self-hosted GPU inference has a non-trivial operational footprint. CUDA drivers, model upgrades, GPU memory accounting, scaling policies, cost management. If the project ships and nobody owns the day-2 work, you have a $5k/month idle GPU with a model from six months ago that nobody trusts anymore.
Antidote: write the runbook before you launch. Name the on-call roster. Decide what "model upgrade" looks like and who approves it. We require this in the SOW before we start.
Failure mode 2: the chat UI is not actually used
The frontier model you compare against is a year ahead. If your private chat UI is slower, has a worse model, no web access, no document attachments, and is not in the place developers already work, they will stay in the public tool — and your data exposure has not actually decreased.
Antidote: be honest about what the private UI is for. Often the right answer is not "replace ChatGPT for everyone" but "stand up a regulated channel for the work that genuinely cannot use the public tool", with SSO-gated access and clear policy on what flows where.
What we ship in two weeks
Our Private AI Starter package deploys this shape inside your AWS in two weeks: chat UI, inference (Ollama or Bedrock), SSO, structured logging, and a written runbook. That is enough for a credible internal pilot. It is not enough for a 1,000-person rollout — that engagement needs evals, fine-tuning decisions, prompt governance, and a 24x7 model.
If you are weighing private AI right now, the cheapest first step is a 1-hour conversation about what your data actually requires. We say "you do not need this yet" more often than we say "yes, ship it".