What I learned contributing to llm-d, a production inference router

Most LLM infrastructure writing is about the model. Very little is about the layer that sits in front of it.

That layer is where production inference actually lives. Requests arrive, get queued, get prioritized, and get routed to a GPU backend that may already be saturated. The model is the easy part. Deciding which request gets served, in what order, under what memory budget, is the hard part.

For the last two months I have been contributing to llm-d, the inference routing layer maintained by engineers from Red Hat, IBM, and Google. This is what I learned doing it.

What llm-d is

llm-d is a smart load balancer for LLM inference on Kubernetes.

It sits between users and the GPU servers running models like vLLM. For every incoming request it decides which backend should handle it, based on factors like GPU memory pressure and cached prefixes. It has a flow-control layer that queues and prioritizes requests, where each priority level gets a band with its own memory budget.

This is not a demo. It is the kind of component that decides whether your inference cluster stays within SLA under load.

The contribution that taught me the most

The flow-control system assigns each request a numeric priority. Positive means important. Negative means sheddable.

When traffic arrives at a priority level that was not pre-configured, the system creates a band on the fly using a single default template. The problem: there was no way to give negative-priority traffic a smaller memory budget than positive-priority traffic. Both shared the same default.

So you could not express a very ordinary operational wish:

Normal traffic gets 1GB of queue budget. Sheddable traffic gets zero, and is rejected the moment the system is under pressure.

I added a second template, DefaultNegativePriorityBand, used only when dynamically provisioning bands for priority below zero. If it is not set, behavior is unchanged, so the change is backward compatible. The runtime selects the template by the sign of the priority.

The code change is small. The interesting part was everything around it: the user-facing API config, the internal config with validation and deep-copy, the runtime provisioning logic, and nine tests covering construction, API translation, clone isolation, and dynamic provisioning for positive, negative, and fallback cases.

That ratio is the lesson. In production infrastructure, the three-line behavior change is the easy part. The API surface, the validation, the backward-compatibility guarantee, and the tests are the work.

The unglamorous PRs matter too

Not every contribution is a feature.

One of my first merged PRs was a CI/CD hardening pass: pinning a mutable action reference to a commit SHA, adding least-privilege permission blocks to nine workflows, adding a timeout to a test job that could otherwise hold a runner for six hours, adding concurrency groups so new pushes cancel stale runs, and adding dependency vulnerability checking.

None of that changes what the router does. All of it changes whether the project is safe and cheap to operate. Supply-chain security and CI cost are production concerns, even when they are invisible in the product.

What production inference infrastructure actually looks like

Three things stood out from the inside.

First, the priority and memory-budget logic is the product. The routing decision is where latency and cost are won or lost, not the model call itself.

Second, backward compatibility is sacred. Every change has to assume an operator is already running the old behavior in production. New fields default to the previous behavior. You earn the right to change defaults slowly.

Third, the test suite is the spec. Reviewers do not grade the feature by reading the description. They read the tests to understand what you actually guaranteed.

Why I am doing this

I build production platform and observability infrastructure for a living. Contributing to llm-d is how I work on the inference layer specifically, in a real codebase that real operators depend on, instead of in a toy.

The merged PRs are public: the flow-control priority band feature, the CI hardening pass, and a second CI reliability pass, with more open across llm-d-kv-cache and llm-d-benchmark.

The model gets the attention. The routing layer decides whether the system survives real traffic.