AIAWSSecurity

Running an LLM inside your VPC without making it your full-time job

When private AI actually pays for itself, what the realistic stack looks like (LibreChat / OpenWebUI / Ollama on AWS), and the two failure modes that kill these projects.

2026-05-08

Running an LLM inside your VPC without making it your full-time job

Almost every team we talk to about "private AI" is asking the same underlying question: can we get the productivity of ChatGPT without sending our data to OpenAI? The honest answer is mostly, with caveats — and the cost-benefit only swings clearly in your favour for a small set of shapes.

This post is the version we wish more buyers heard before they bought a $300k self-hosted LLM contract.

When private AI actually pays for itself

Three honest reasons to run inference inside your own VPC:

Data sensitivity that contractually cannot leave your boundary. Healthcare, financial services, classified-adjacent work, regulated personal data with strong residency clauses.
Concentration risk reduction. You already pay six figures a year to one model vendor and your CTO does not want a single-vendor dependency on the application's critical path.
Workloads dominated by structured retrieval. RAG over your internal docs where the model is mostly stitching context together, not doing frontier reasoning. Smaller open-weight models become genuinely competitive here.

Reasons that sound like good ones but usually are not:

"We want to save money on the API bill." For most teams the OpenAI / Anthropic bill is well below the fully-loaded cost of running and operating GPU inference yourself. Run the maths before you commit.
"Our developers will be more productive with their own LLM." They will be less productive than they would be on a frontier model. Productivity delta is real.

The realistic stack

A realistic private-AI stack for an APAC mid-size org looks roughly like this:

                    ┌────────────────────────────────────────┐
                    │ Internal users (SSO via Okta / EntraID) │
                    └─────────────┬──────────────────────────┘
                                  │
                                  ▼
                    ┌─────────────────────────────┐
                    │  LibreChat or OpenWebUI     │  ← chat UI, audit log
                    │  on ECS/Fargate behind ALB  │
                    └─────────────┬───────────────┘
                                  │
                  ┌───────────────┼─────────────────┐
                  ▼               ▼                 ▼
          ┌──────────────┐ ┌─────────────┐ ┌────────────────────┐
          │ Ollama on    │ │ Bedrock     │ │ Vector DB          │
          │ GPU EC2      │ │ (managed)   │ │ (pgvector / Qdrant)│
          └──────────────┘ └─────────────┘ └────────────────────┘

Concrete choices that have held up well:

Chat surface: LibreChat (most flexible, plugin model) or OpenWebUI (tighter UX out of the box). Both deploy as containers; we run them on ECS/Fargate behind an ALB with SSO integration.
Inference: Ollama on a single g6 or g5 GPU instance for genuinely private workloads, or AWS Bedrock when "on AWS" is enough and you want managed scaling.
Embeddings + retrieval: pgvector on an existing RDS for teams that already run Postgres; Qdrant on Fargate when you want isolation.
Logging and audit: structured logs to CloudWatch, with a "last 90 days, who asked what" dashboard the security team can read.
SSO: Okta or Entra ID via OIDC. We turn on group-based access from day one — it is much harder to retrofit.

The two failure modes that kill these projects

We have unwound enough of these to see the same two deaths.

Failure mode 1: nobody owns the GPU

Self-hosted GPU inference has a non-trivial operational footprint. CUDA drivers, model upgrades, GPU memory accounting, scaling policies, cost management. If the project ships and nobody owns the day-2 work, you have a $5k/month idle GPU with a model from six months ago that nobody trusts anymore.

Antidote: write the runbook before you launch. Name the on-call roster. Decide what "model upgrade" looks like and who approves it. We require this in the SOW before we start.

Failure mode 2: the chat UI is not actually used

The frontier model you compare against is a year ahead. If your private chat UI is slower, has a worse model, no web access, no document attachments, and is not in the place developers already work, they will stay in the public tool — and your data exposure has not actually decreased.

Antidote: be honest about what the private UI is for. Often the right answer is not "replace ChatGPT for everyone" but "stand up a regulated channel for the work that genuinely cannot use the public tool", with SSO-gated access and clear policy on what flows where.

What we ship in two weeks

Our Private AI Starter package deploys this shape inside your AWS in two weeks: chat UI, inference (Ollama or Bedrock), SSO, structured logging, and a written runbook. That is enough for a credible internal pilot. It is not enough for a 1,000-person rollout — that engagement needs evals, fine-tuning decisions, prompt governance, and a 24x7 model.

If you are weighing private AI right now, the cheapest first step is a 1-hour conversation about what your data actually requires. We say "you do not need this yet" more often than we say "yes, ship it".

AIAWSSecurity

Running an LLM inside your VPC without making it your full-time job

When private AI actually pays for itself, what the realistic stack looks like (LibreChat / OpenWebUI / Ollama on AWS), and the two failure modes that kill these projects.

2026-05-08

Running an LLM inside your VPC without making it your full-time job

This post is the version we wish more buyers heard before they bought a $300k self-hosted LLM contract.

When private AI actually pays for itself

Three honest reasons to run inference inside your own VPC:

Data sensitivity that contractually cannot leave your boundary. Healthcare, financial services, classified-adjacent work, regulated personal data with strong residency clauses.
Concentration risk reduction. You already pay six figures a year to one model vendor and your CTO does not want a single-vendor dependency on the application's critical path.
Workloads dominated by structured retrieval. RAG over your internal docs where the model is mostly stitching context together, not doing frontier reasoning. Smaller open-weight models become genuinely competitive here.

Reasons that sound like good ones but usually are not:

"We want to save money on the API bill." For most teams the OpenAI / Anthropic bill is well below the fully-loaded cost of running and operating GPU inference yourself. Run the maths before you commit.
"Our developers will be more productive with their own LLM." They will be less productive than they would be on a frontier model. Productivity delta is real.

The realistic stack

A realistic private-AI stack for an APAC mid-size org looks roughly like this:

                    ┌────────────────────────────────────────┐
                    │ Internal users (SSO via Okta / EntraID) │
                    └─────────────┬──────────────────────────┘
                                  │
                                  ▼
                    ┌─────────────────────────────┐
                    │  LibreChat or OpenWebUI     │  ← chat UI, audit log
                    │  on ECS/Fargate behind ALB  │
                    └─────────────┬───────────────┘
                                  │
                  ┌───────────────┼─────────────────┐
                  ▼               ▼                 ▼
          ┌──────────────┐ ┌─────────────┐ ┌────────────────────┐
          │ Ollama on    │ │ Bedrock     │ │ Vector DB          │
          │ GPU EC2      │ │ (managed)   │ │ (pgvector / Qdrant)│
          └──────────────┘ └─────────────┘ └────────────────────┘

Concrete choices that have held up well:

Chat surface: LibreChat (most flexible, plugin model) or OpenWebUI (tighter UX out of the box). Both deploy as containers; we run them on ECS/Fargate behind an ALB with SSO integration.
Inference: Ollama on a single g6 or g5 GPU instance for genuinely private workloads, or AWS Bedrock when "on AWS" is enough and you want managed scaling.
Embeddings + retrieval: pgvector on an existing RDS for teams that already run Postgres; Qdrant on Fargate when you want isolation.
Logging and audit: structured logs to CloudWatch, with a "last 90 days, who asked what" dashboard the security team can read.
SSO: Okta or Entra ID via OIDC. We turn on group-based access from day one — it is much harder to retrofit.

The two failure modes that kill these projects

We have unwound enough of these to see the same two deaths.

Failure mode 1: nobody owns the GPU

Antidote: write the runbook before you launch. Name the on-call roster. Decide what "model upgrade" looks like and who approves it. We require this in the SOW before we start.