The missing deployment gate for AI applications
Normal software has CI gates, smoke tests, canaries, and SLOs. AI apps need the same discipline for eval quality, token cost, LLM/RAG behavior, observability, and rollback readiness.
Most AI prototypes fail in production for boring reasons.
The model is not always the problem. The demo often works. The prompt looks reasonable. The RAG answer is good on the three examples in the notebook. The service returns 200. The dashboard is green.
Then it meets real traffic:
- a prompt change breaks an important case
- retrieval silently gets worse
- token usage doubles
- one route starts using the expensive model
- traces do not contain the prompt version or model name
- the service is up, but users wait too long for the first token
- nobody knows how to roll back the prompt, model, retrieval config, or provider
That is not a research problem. It is a production readiness problem.
Software already learned this lesson
Normal software deployments have a mature release discipline.
Before traffic moves, teams run tests, linters, type checks, security scans, smoke tests, health checks, canaries, policy checks, and SLO-based rollout decisions. Kubernetes has readiness probes. CI systems have required checks. Progressive delivery tools can stop a rollout when the new version behaves worse than the old one.
Classical ML learned a related lesson. Mature MLOps stacks have model registries, data validation, model validation, baseline comparison, drift monitoring, model quality monitoring, and approval or blessing workflows.
So the idea of a deployment gate is not new.
The missing part is applying that discipline cleanly to modern AI applications: LLM API calls, RAG systems, prompt changes, model/provider routing, tool calls, token budgets, and AI-specific observability.
The AI-specific gap
An AI feature can pass normal deployment checks and still be unsafe to ship.
The container starts. The HTTP endpoint responds. CPU and memory look fine. The Kubernetes readiness probe is green. The unit tests pass.
But none of that answers:
- Did answer quality regress?
- Did retrieval quality regress?
- Did the prompt change break important examples?
- Did the model or provider change behavior?
- Is cost per request still inside budget?
- Are model, prompt version, token count, latency, cost, request ID, and error type observable?
- Can we roll back the model, prompt, retrieval config, or provider route?
That is the gap I care about.
The preflight question
The useful question is not:
Is the service up?
The useful question is:
Is this AI change ready for production traffic?
That question needs one report, not five disconnected dashboards.
Can we ship this AI change?
quality/evals PASS or FAIL
RAG behavior PASS or FAIL
latency/TTFT PASS or FAIL
error rate PASS or FAIL
cost budget PASS or FAIL
observability PASS or WARN or FAIL
rollback/runbook PASS or WARN or FAIL
overall verdict PASS or FAIL
This is the layer I am building with aipreflight.
What aipreflight does
aipreflight is a CI/CD readiness gate for AI applications. It checks eval
quality, LLM/RAG behavior, latency, errors, cost budgets, observability, and
rollout readiness before traffic is routed.
The command is intentionally small:
aipreflight check --profile profiles/app.yml
aipreflight check --profile profiles/rag.yml
aipreflight check --profile profiles/inference.yml
The output is intentionally blunt:
Verdict: PASS
cost PASS $7.69/mo across 1 call site(s), within budget
evals PASS quality gate passed: pass rate 100% (min 90%)
observability PASS telemetry config present with required fields
deployment PASS rollback runbook present
It writes a machine-readable JSON report for CI and a Markdown report for humans. The important part is the verdict. A deployment gate must be able to block a release without a human interpreting graphs.
The profiles
aipreflight has three profiles because “AI app” is not one thing.
The app profile is for hosted-API applications. These teams may not run their
own inference infrastructure, but they still need cost budgets, evals,
observability fields, and rollback documentation.
The rag profile checks the failure modes that infrastructure misses:
retrieval precision, answer quality, citation rate, hallucination rate, empty
retrieval handling, cost, and telemetry.
The inference profile is for self-hosted or OpenAI-compatible inference
endpoints. It uses llmprobe for
client-side TTFT, latency, throughput, and error probes, then optionally
correlates that with Prometheus/vLLM metrics.
Why llmprobe and tokentoll stay separate
I do not want aipreflight to become a giant platform.
The sharper design is:
aipreflight -> the release decision
llmprobe -> external user-path probes
tokentoll -> token and cost budget checks
Prometheus -> service and inference telemetry
eval runner -> quality results from pytest, promptfoo, ragas, or custom code
llmprobe answers:
What does the user path actually experience?
tokentoll answers:
What will this code cost when it runs?
aipreflight answers:
Given all of those signals, should this AI change ship?
That separation matters. The value is not replacing every specialized tool. The value is turning the signals into one release decision.
What this does not replace
This is not a model registry. It is not a vector database. It is not an LLM gateway. It is not a tracing backend. It is not Grafana. It is not Kubernetes. It is not a complete MLOps platform.
Those tools should continue to do what they are good at.
aipreflight sits at the release boundary. It asks the production question
before a prompt, model, RAG pipeline, provider route, or inference endpoint gets
traffic.
The prototype-to-production pattern
The pattern I want to make repeatable is:
prototype
-> service
-> eval suite
-> cost budget
-> observability contract
-> deployment gate
-> rollback/runbook
-> production traffic
That is the difference between “we have an AI demo” and “we can operate this AI feature”.
The technical work is not glamorous. It is often YAML, small adapters, fixtures, exit codes, reports, and boring checks. But those are exactly the pieces that turn AI systems from impressive prototypes into software a team can own.
The positioning
The claim is not:
I invented deployment readiness.
The claim is:
Normal software has mature deployment gates. Classical ML has model validation and model blessing. Modern AI applications need equivalent preflight checks for eval quality, RAG behavior, token cost, provider/model behavior, observability, and rollback readiness.
That is the space I want to work in: AI platform and reliability engineering for production AI systems.
The model matters. But after the prototype works, the production layer decides whether the system survives contact with users.