The missing deployment gate for AI applications

Most AI prototypes fail in production for boring reasons.

The model is not always the problem. The demo often works. The prompt looks reasonable. The RAG answer is good on the three examples in the notebook. The service returns 200. The dashboard is green.

Then it meets real traffic:

a prompt change breaks an important case
retrieval silently gets worse
token usage doubles
one route starts using the expensive model
traces do not contain the prompt version or model name
the service is up, but users wait too long for the first token
nobody knows how to roll back the prompt, model, retrieval config, or provider

That is not a research problem. It is a production readiness problem.

Software already learned this lesson

Normal software deployments have a mature release discipline.

Before traffic moves, teams run tests, linters, type checks, security scans, smoke tests, health checks, canaries, policy checks, and SLO-based rollout decisions. Kubernetes has readiness probes. CI systems have required checks. Progressive delivery tools can stop a rollout when the new version behaves worse than the old one.

Classical ML learned a related lesson. Mature MLOps stacks have model registries, data validation, model validation, baseline comparison, drift monitoring, model quality monitoring, and approval or blessing workflows.

So the idea of a deployment gate is not new.

The missing part is applying that discipline cleanly to modern AI applications: LLM API calls, RAG systems, prompt changes, model/provider routing, tool calls, token budgets, and AI-specific observability.

The AI-specific gap

An AI feature can pass normal deployment checks and still be unsafe to ship.

The container starts. The HTTP endpoint responds. CPU and memory look fine. The Kubernetes readiness probe is green. The unit tests pass.

But none of that answers:

Did answer quality regress?
Did retrieval quality regress?
Did the prompt change break important examples?
Did the model or provider change behavior?
Is cost per request still inside budget?
Are model, prompt version, token count, latency, cost, request ID, and error type observable?
Can we roll back the model, prompt, retrieval config, or provider route?

That is the gap I care about.

The preflight question

The useful question is not:

Is the service up?

The useful question is:

Is this AI change ready for production traffic?

That question needs one report, not five disconnected dashboards.

Can we ship this AI change?

quality/evals        PASS or FAIL
RAG behavior         PASS or FAIL
latency/TTFT         PASS or FAIL
error rate           PASS or FAIL
cost budget          PASS or FAIL
observability        PASS or WARN or FAIL
rollback/runbook     PASS or WARN or FAIL
overall verdict      PASS or FAIL

This is the layer I am building with aipreflight.

What aipreflight does

aipreflight is a CI/CD readiness gate for AI applications. It checks eval quality, LLM/RAG behavior, latency, errors, cost budgets, observability, and rollout readiness before traffic is routed.

The command is intentionally small:

aipreflight check --profile profiles/app.yml
aipreflight check --profile profiles/rag.yml
aipreflight check --profile profiles/inference.yml

The output is intentionally blunt:

Verdict: PASS
  cost           PASS  $7.69/mo across 1 call site(s), within budget
  evals          PASS  quality gate passed: pass rate 100% (min 90%)
  observability  PASS  telemetry config present with required fields
  deployment     PASS  rollback runbook present

It writes a machine-readable JSON report for CI and a Markdown report for humans. The important part is the verdict. A deployment gate must be able to block a release without a human interpreting graphs.

The profiles

aipreflight has three profiles because “AI app” is not one thing.

The app profile is for hosted-API applications. These teams may not run their own inference infrastructure, but they still need cost budgets, evals, observability fields, and rollback documentation.

The rag profile checks the failure modes that infrastructure misses: retrieval precision, answer quality, citation rate, hallucination rate, empty retrieval handling, cost, and telemetry.

The inference profile is for self-hosted or OpenAI-compatible inference endpoints. It uses llmprobe for client-side TTFT, latency, throughput, and error probes, then optionally correlates that with Prometheus/vLLM metrics.

Why llmprobe and tokentoll stay separate

I do not want aipreflight to become a giant platform.

The sharper design is:

aipreflight  -> the release decision
llmprobe     -> external user-path probes
tokentoll    -> token and cost budget checks
Prometheus   -> service and inference telemetry
eval runner  -> quality results from pytest, promptfoo, ragas, or custom code

llmprobe answers:

What does the user path actually experience?

tokentoll answers:

What will this code cost when it runs?

aipreflight answers:

Given all of those signals, should this AI change ship?

That separation matters. The value is not replacing every specialized tool. The value is turning the signals into one release decision.

What this does not replace

This is not a model registry. It is not a vector database. It is not an LLM gateway. It is not a tracing backend. It is not Grafana. It is not Kubernetes. It is not a complete MLOps platform.

Those tools should continue to do what they are good at.

aipreflight sits at the release boundary. It asks the production question before a prompt, model, RAG pipeline, provider route, or inference endpoint gets traffic.

The prototype-to-production pattern

The pattern I want to make repeatable is:

prototype
  -> service
  -> eval suite
  -> cost budget
  -> observability contract
  -> deployment gate
  -> rollback/runbook
  -> production traffic

That is the difference between “we have an AI demo” and “we can operate this AI feature”.

The technical work is not glamorous. It is often YAML, small adapters, fixtures, exit codes, reports, and boring checks. But those are exactly the pieces that turn AI systems from impressive prototypes into software a team can own.

The positioning

The claim is not:

I invented deployment readiness.

The claim is:

Normal software has mature deployment gates. Classical ML has model validation and model blessing. Modern AI applications need equivalent preflight checks for eval quality, RAG behavior, token cost, provider/model behavior, observability, and rollback readiness.

That is the space I want to work in: AI platform and reliability engineering for production AI systems.

The model matters. But after the prototype works, the production layer decides whether the system survives contact with users.