Building a battery analytics platform for 100k+ IoT devices, alone

A battery prediction problem does not stay a prediction problem.

It starts as a model. When does this thermostat’s battery run out? Then reality arrives. The estimate needs field data, and the field data needs a pipeline. The pipeline needs a schedule. The estimate needs a service so other systems can read it. The service needs a dashboard so humans can trust it. The dashboard needs monitoring so failures are noticed. The whole thing needs to be deployable and rollback-able.

By the end you do not have a model. You have a platform.

This is how a capacity and runtime estimation problem for 100,000+ IoT thermostats turned into an eight-repository production system that I built and own end to end.

The shape of the system

The platform is not one service. It is a small set of components, each with one job.

field data  -> ingestion (Dagster -> Parquet/MinIO)
            -> estimation library (capacity + runtime, typed, tested, internal PyPI)
            -> services (FastAPI: per-device estimates, fleet queries)
            -> dashboards (Streamlit for analysts, company Vue dashboard for the product)
            -> monitoring (health checks, alerting)

The most important decision was making the estimation logic a library, not a script buried in a pipeline.

The library is the core

The capacity and runtime estimation lives in a single typed Python library, published to an internal PyPI.

It has full test coverage and strict type checking. That sounds like overhead for an internal tool. It is the opposite. The library is consumed by the ingestion pipeline, the per-device diagnostic tool, the fleet reporting, and the evaluation pipeline. If the estimation logic changed behavior silently, four downstream systems would drift at once.

Making it a versioned, tested, installable package meant every consumer pinned a known version, and a change to the algorithm was a deliberate release, not an accident.

The estimation logic is used in four places. So it gets versioned like a dependency, not edited like a notebook.

Ingestion is boring on purpose

The ingestion side uses Dagster to pull device telemetry, build runtime and capacity datasets, and store them as partitioned Parquet in MinIO.

Partitioning by time and storing in an object store sounds like over-engineering for battery data. But the fleet is 100,000+ devices producing telemetry continuously. Reprocessing a month should not mean reprocessing everything, and a re-run should be cheap and idempotent. Boring, partitioned, replayable ingestion is what makes the rest of the platform calm.

Services and dashboards: trust needs a surface

A prediction nobody can inspect is a prediction nobody trusts.

So the platform exposes its estimates two ways. FastAPI services answer machine questions: what is the estimate for this device, this fleet, this customer. A per-device diagnostic surface answers the human question: why did the system predict this for that thermostat. And the fleet-level estimates are integrated into the company’s existing Vue dashboard, where the product team and customers already look.

That last point matters for honesty about scope. I did not build the company dashboard. I built the battery analytics that feed it, and integrated them into a frontend other engineers own. Knowing where your system ends is part of owning it.

Monitoring closes the loop

The platform watches itself. Health checks and alerting catch ingestion failures, stale datasets, and data-quality problems before they reach the people who depend on the numbers.

This is the difference between a project and a system. A project produces an output once. A system keeps producing correct outputs, and tells you when it cannot.

What building it alone taught me

Doing the whole thing solo, from algorithm to UI integration, forced a discipline that a larger team can paper over.

Every component had to be simple enough that one person could hold it in their head and operate it. That pushed me toward a tested core library, boring replayable ingestion, thin services, and monitoring that fails loudly. Not because those are elegant, but because they are the only way one engineer can own eight repositories without the whole thing becoming fragile.

The battery question was the easy part. Turning it into infrastructure a company can depend on was the work.